# delayed_reinforcement_learning_by_imitation__aac05233.pdf

Delayed Reinforcement Learning by Imitation

Pierre Liotet * 1 Davide Maran * 1 Lorenzo Bisi 1 Marcello Restelli 1

When the agent s observations or interactions are delayed, classic reinforcement learning tools usually fail. In this paper, we propose a simple yet new and efficient solution to this problem. We assume that, in the undelayed environment, an efficient policy is known or can be easily learned, but the task may suffer from delays in practice and we thus want to take them into account. We present a novel algorithm, Delayed Imitation with Dataset Aggregation (DIDA), which builds upon imitation learning methods to learn how to act in a delayed environment from undelayed demonstrations. We provide a theoretical analysis of the approach that will guide the practical design of DIDA. These results are also of general interest in the delayed reinforcement learning literature by providing bounds on the performance between delayed and undelayed tasks, under smoothness conditions. We show empirically that DIDA obtains high performances with a remarkable sample efficiency on a variety of tasks, including robotic locomotion, classic control, and trading.

1. Introduction

In reinforcement learning (RL), it is generally assumed that the effect of an action over the environment is known instantaneously to the agent. However, in the presence of delays, this classic setting is challenged. The effect of a delayed action execution or state observation, if not accounted for, can have perilous effects in practice (Dulac-Arnold et al., 2019). It can induce a performance loss in trading (Wilcox, 1993), create instability in dynamic systems (Dugard & Verriest, 1998; Gu & Niculescu, 2003), be detrimental to the training of real-world robots (Mahmood et al., 2018). To further grasp the importance of delay, one may notice that most traffic laws around the world base safety distances on

*Equal contribution 1Politecnico di Milano, Milan, Italy. Correspondence to: Pierre Liotet <pierre.liotet@polimi.it>.

Proceedings of the 39 th International Conference on Machine Learning, Baltimore, Maryland, USA, PMLR 162, 2022. Copyright 2022 by the author(s).

drivers reaction time , which is partly due to the perception of the event and partly to the implementation of the action (Dro zdziel et al., 2020). These two types of delay have an exact correspondence in RL, where they are dubbed as state observation and action execution delays. There are many ways in which delays can further vary. They may be anonymous (i.e., not known to the agent), constant or stochastic, integer or non-integer. In this work, as in most of the literature, we focus on constant non-anonymous delays in the action execution or, equivalently (Katsikopoulos & Engelbrecht, 2003), in the state observation.

Previous research can be divided into three main directions. In memoryless approaches the agent s policy depends on the last observed state (Schuitema et al., 2010). Augmented approaches try to cast the problem into a Markov decision process (MDP) by building policies based on an augmented state, composed of the last observed state and on the actions that the agent knows it has taken since then (Bouteiller et al., 2020). A last line of research, which we refer to as model-based approach, considers using as a policy input any statistics about the current unknown state that can be computed from the augmented state (Walsh et al., 2009; Firoiu et al., 2018; Chen et al., 2021; Agarwal & Aggarwal, 2021; Derman et al., 2021; Liotet et al., 2021). The goal of this approach is to avoid the curse of dimensionality posed by the augmented state (Walsh et al., 2009; Bouteiller et al., 2020). We formalize the problem of delays in Section 3 and give a more in-depth description of the literature in Section 4.

We adopt the simple yet practically effective idea of learning a policy in a delayed environment by applying imitation learning to a policy learned in the undelayed environment, as described in Section 5. We propose to use DAGGER (Ross et al., 2011) as the imitation learning algorithm. It is particularly well suited to the task since, being one of the few algorithms to compute the loss under the learner s own distribution (Osa et al., 2018), it is able to account for the shift in distribution induced by the delay We provide a theoretical analysis (Section 6) which, under smoothness conditions over the MDP, bounds the performance lost by introducing delays. Finally, we provide an extensive experimental analysis (Section 7) where our algorithm is compared with state-of-the-art approaches on a variety of delayed problems and demonstrates great performances and sample efficiency.

Delayed Reinforcement Learning by Imitation

2. Preliminaries

Reinforcement Learning A discrete-time discounted Markov Decision Process (MDP) (Puterman, 2014) is a 6-tuple M p S, A, p, R, γ, µq where S and A are measurable sets of states and actions respectively, pps1|s, aq is the probability to transition to a state s1 departing from state s and taking action a, Rps, aq is a random variable defining the reward collected during such a transition. We denote by rps, aq its expected value. Finally, µ is the initial state distribution. The agent s goal is to find a policy π, which assigns probabilities to the actions given a state, to maximize the expected discounted return with discount factor γ P r0, 1q, defined as1

Jpπq E st 1 pp |st,atq at πp |stq s0 µp q

t 0 γt Rpst, atq

We consider an infinite horizon setting, where H 8. Note that 1 1 γ can be seen as the effective horizon in this case. We restrict the set of policies to the stationary Markovian policies, Π, as it contains the optimal one (Puterman, 2014). RL analysis frequently introduces the concept of state-action value function, which quantifies the expected return obtained under some policy, starting from a given state and fixing the first action. Formally, this function is defined as

Qπps, aq E at πp |stq

t 0 γt Rpst, atq ˇˇˇˇ s0 s, a0 a

Similarly we define the state value function as V πpsq Ea πp |sqr Qπps, aqs. Lastly, we consider the discounted visited state distribution under some policy π, starting from any initial distribution ρ, for some state s P S as

dπ ρpsq p1 γq

t 0 γt Ppst s|π, ρq.

Lipschitz MDPs We now introduce notions that will allow us to characterize the smoothness of an MDP. Let L ą 0 and let p X, d Xq and p Y, d Y q be two metric spaces. A function f : X Ñ Y is said to be L-Lipschitz continuous (L-LC) if, @x, x1 P X, d Y pfpxq, fpx1qq ď L d Xpx, x1q. We denote the Lipschitz semi-norm of a function f as }f}L supx,x1PX,x x1 d Y pfpxq,fpx1qq

d Xpx,x1q . In real space X Ă Rn, we use as distance the Euclidean one, i.e., d Xpx, x1q x x1 2. As for the probabilities, we use the L1-Wasserstein distance, which for some probabilities µ, ν with sample space Ωis (Villani, 2009):

W1pµ, νq sup }f}Lď1

Ω fpωqpµ νqpdωq ˇˇˇˇ .

1In the sequel, we will tacit that st 1 pp |st, atq.

We now use those concepts to quantify the smoothness of an MDP.

Definition 2.1 (Lipschitz MDP). An MDP is said to be p LP , Lrq-LC if, for all ps, aq, ps1, a1q P S ˆ A

W1ppp |s, aq, pp |s1, a1qq ď LP d Sps, s1q d Apa, a1q , ˇˇrps, aq rps1, a1q ˇˇ ď Lr d Sps, s1q d Apa, a1q .

Definition 2.2 (Lipschitz policy). A stationary Markovian policy π is said to be Lπ-LC if, @s, s1 P S

W1pπp |sq, πp |s1qq ď Lπ d Sps, s1q.

These concepts provide useful tools for theoretical analysis and have been extensively used in the field of RL (Rachelson & Lagoudakis, 2010). Under the assumption of p LP , Lrq LC MDP and Lπ-LC policy π, provided that γLP p1 Lπq ď 1, then Qπ is LQ-LC with LQ Lr 1 γLP p1 Lπq (Rachelson & Lagoudakis, 2010, Theorem 1). This property can be useful to prove the Lipschitzness of the Q function.

Additionally, in the case of delays, the smoothness of trajectories (sequence of consecutive states and actions) is a key factor. Intuitively, smoother trajectories make the current unknown state more predictable. Therefore, we consider the concept of time-Lipschitzness, introduced by Metelli et al. (2020).

Definition 2.3 (Time-Lipschitz MDP). An MDP is said to be LT -Time Lipschitz Continuous (LT -TLC) if, @s, a P S ˆ A

W1ppp |s, aq, δsq ď LT ,

where δs is the Dirac distribution with mass on s.

3. Problem Definition

A delayed MDP (DMDP) stems from an MDP endowed with a sequence of variables p tqt PN corresponding to the delay at each step of the sequential process. The delay can affect the state observation, which implies that the agent has no access to the current state but only to a state visited t steps before. If it affects the action execution instead, the delay implies that the agent must select an action that will be executed in t steps from now. Lastly, reward collection delays may raise credit assignment issues and are outside the scope of this paper. In the first two cases, the DMDP violates the Markov assumption since the next observed state-reward couple does not depend only on the currently observable state and the chosen action. In the literature, the delay is usually assumed to be a Markovian process, that is t Pp | t 1, st 1, at 1q. Note that this definition includes state dependent delays when t Pp |st 1q, Markov chain delays when t Pp | t 1q and stochastic delays when p tqt PN are i.i.d.

Delayed Reinforcement Learning by Imitation

In this work we consider constant delays, denoting the delay with the symbol . When the delay is constant, the action execution delay and the state observation one are equivalent (Katsikopoulos & Engelbrecht, 2003), thus, it is sufficient to consider only the state observation delay. Furthermore, following Katsikopoulos & Engelbrecht (2003), we consider a reward collection delay equal to the state observation delay so as not to collect a reward on a yet unobserved state, which could result in some form of partial state information. Finally, we assume that the delay is known to the agent, placing ourselves in the non-anonymous delay framework.

Within this reduced framework, it is possible to introduce an important concept of DMDPs, the augmented state. Given the last observed state s and the sequence of actions pa1, . . . , a q which have been taken since then, but whose outcome has not yet been observed, the agent can construct an augmented state, i.e., a new state in X S ˆ A which casts the DMDP into an MDP (Bertsekas, 1987; Altman & Nain, 1992). Said alternatively, the augmented state contains all the information the agent needs to learn the optimal policy in the DMDP. From the augmented state, we can gather information on the current state. This information can be summarized by the belief, the probability distribution of the current unknown state s given the augmented state x as bps|xq. More explicitly, given x ps1, a1, . . . a q, one has

S 1 pps|s , a q

i 2 ppsi|si 1, ai 1qdsi.

The delayed reward collected for playing action a on x is given by rrpx, aq Es bp |xq rrps, aqs. To complete the DMDP framework, we define rµ, the initial augmented state distribution. It samples the state contained in the augmented state under µ and samples the first action sequence under a distribution whose choice depends on the environment. We consider a uniform distribution on A.

4. Related Works

The first proposed solution to the problem of delays is to use regular RL algorithms on the augmented-state MDP. Although the optimal delayed policy could potentially be obtained, this approach is affected by the exponential growth of the augmented state space, which becomes |S||A| (Walsh et al., 2009) and can be harmful for the performance in practice. Nonetheless, recent work by Bouteiller et al. (2020) revisits this approach and propose an ingenious way to resample trajectories without interacting with the environment by populating the augmented state with actions from a different policy, greatly improving the sample efficiency. They propose an algorithm, Delay-Correcting Actor-Critic (DCAC), which builds on SAC (Haarnoja et al., 2018) us-

ing the aforementioned resampling idea. DCAC has great experimental results and is sample efficient by design.

A second line of research focuses on memoryless policies, inspired by the partially observable MDP literature. It ignores the action queue to act according to the last observed state only. However, the delay can still be taken into account as in d SARSA (Schuitema et al., 2010), a modified version of SARSA (Sutton & Barto, 2018) which accounts for the delay during its update. Indeed, SARSA would credit the reward collected for applying action a on the augmented state x, containing the last observed state s, to the pair ps, aq. Instead, d SARSA proposes to credit ps, a1q, where a1 is the oldest action stored in x, the action actually applied on s. Despite being memoryless, d SARSA often achieves good performances in practice.

Finally, the most common line of research, the model-based approach, relies on computing statistics on the current state which are then used to select an action. The name modelbased comes from the fact that those solutions usually learn a model of the environment to predict the current state, by simulating the effect of the actions stored in the augmented state on the last observed state. Walsh et al. (2009) learn the transition as a deterministic mapping so as to predict the most probable state, before selecting actions based on it. Derman et al. (2021) and Firoiu et al. (2018) propose a similar approach by learning the transitions with feed-forward and recurrent neural networks, respectively. Agarwal & Aggarwal (2021) estimate the transition probabilities and the undelayed Q function to select the action that gives the maximum Q under the estimated distribution of the current state. Chen et al. (2021) use a particle-based approach to produce potential outcomes for the current state and, interestingly, extend the predictions to collect better value estimates. Liotet et al. (2021) propose D-TRPO which learns a vectorial encoding of the belief of the current state which is then used as an input to the policy, the latter being trained with TRPO (Schulman et al., 2015a). The authors also propose another algorithm, L2-TRPO which, instead of the belief, learns the expected current state.

While most of these works assume that the delay is fixed, some consider the problem of stochastic delays (Bouteiller et al., 2020; Derman et al., 2021; Agarwal & Aggarwal, 2021). Only one of them considers the case of non-integer delays (Schuitema et al., 2010).

5. Imitation Learning for Delays

Our proposed approach is motivated by the limitations of two lines of research from the literature. Augmented approaches are affected by the increased dimensionality of the augmented state space that hinders the learning process, while model-based approaches require carefully designed

Delayed Reinforcement Learning by Imitation

models of the state transitions and usually come with a computational burden. Instead, we propose to learn a mapping from augmented state directly to undelayed expert actions, facilitating the learning process as opposed to augmented approaches and removing explicit approximation of transitions as opposed to model-based approaches. It implies however that training is split into two sub-problems: learning an expert undelayed policy and then imitating this policy in a DMDP.

5.1. Imitation Learning

It is usually easier to learn a behavior from demonstrations than learning from scratch using standard RL techniques. Imitation learning aims at learning a policy by mimicking the actions of an expert, bridging the gap between RL and supervised learning. Obviously, it requires that one can collect examples of an expert s behavior to learn from. For an expert policy πE, most imitation learning approaches aim at finding a policy π that minimizes Es d πE µ rlps, πqs (Ross et al., 2011) where lps, πq is a loss designed to make π closer to πE. Note that this objective is defined under the state distribution induced by πE. This can easily be problematic as, whenever the learner makes an error, it could end up in a state where its knowledge of the expert s behavior is poor and therefore errors could accumulate. Indeed, it has been shown (Xu et al., 2020, Theorem 1) that the error made by the learner potentially propagates as the squared effective horizon. This is consistent with other bounds found in the literature depending on H2 in the finite horizon setting (Ross & Bagnell, 2010, Theorem 2.1).

One successful solution to this problem is dataset aggregation as proposed by Ross et al. (2011) in their DAGGER algorithm. The idea is to sample new data under the learned policy and query the expert on those new samples in order to match the learner s state distribution. DAGGER recursively builds a dataset D by sampling trajectories under policy πi βiπE p1 βiqˆπi obtained from a βi-weighted mixture of the expert policy and the previously imitated policy ˆπi. One then queries the expert s policy on the states encountered in these trajectories and adds those tuples ps, πEpsqq to D. Finally, a new imitated policy ˆπi 1 is trained on D. The sequence pβiqi Prr1,Nss is such that β1 1, so as to sample initially only from πE and βN 0 to sample only from the imitated policy in the end.

5.2. Duality of Trajectories

Once sampled either from a DMDP or its underlying MDP, the trajectories can be interpreted in either one of the processes when the delay is an integer number of steps. In a DMDP, the current state will eventually be observed by a delayed agent. In an MDP, the trajectories can be reorganized to simulate the effect of a delay. In particular, this

Algorithm 1 Delayed Imitation with DAGGER (DIDA) Inputs (un)delayed environment E, undelayed expert πE, β routine, number of steps N, empty dataset D. Outputs: delayed policy πI

1: for βi in β-routine do 2: for j in t1, . . . , Nu do 3: if New episode then 4: Initialize state buffer ps1, s2, . . . , s 1q and action buffer pa1, a2 . . . , a q 5: end if 6: Sample a E πEp |s 1q, set a a E 7: if Random u Upr0, 1sq ě βi then 8: Overwrite a πIp |rs1, a1, . . . , a sq 9: end if 10: Aggregate dataset: D Ð D Y prs1, a1, . . . , a 1s, a Eq 11: Apply a in E and get new state s 12: Update buffers: ps1, . . . , s 1q Ð ps2, . . . , s 1, sq pa1, . . . , a q Ð pa2, . . . , a , aq 13: end for 14: Train πI on D 15: end for

means that one can collect trajectories with an undelayed environment and sample either from an undelayed policy or a delayed policy (by creating a synthetic augmented state). This is exactly what is required to adapt DAGGER to imitate an undelayed expert with a delayed learner.

5.3. Imitating an Undelayed Policy

We follow the learning scheme of DAGGER with the slight difference that, if the expert is queried, then the current state is fed to πE while if the imitator policy is queried, an augmented state is built from the past samples, considering the state -steps before the current state and the sequence of actions taken since then. This implies that a buffer of the latest states and actions has to be built. We present our approach, which we call Delayed Imitation with DAGGER (DIDA), in Algorithm 1. In practice there is no need to store each augmented state in the dataset D but only trajectories of state-action pairs from the underlying MDP from which the augmented states can be rebuilt during training. Otherwise, each action would be repeated times in D , since an action is contained in consecutive augmented states.

When sampling from the undelayed environment is prohibited, we apply a slight modification to DIDA as follows. Whenever the undelayed policy should be queried, one could instead sample from it in a memoryless fashion. That means taking the state contained in the augmented state as input to πE, thus substituting Line 6 of Algorithm 1 for a E πEp |s1q. The action stored in the buffer must remain

Delayed Reinforcement Learning by Imitation

the undelayed expert one s on the current unobserved state however. This implies that the agent must wait steps until the current state is observed before updating the buffer for the current augmented state. Thus delaying the execution of Line 10 of Algorithm 1. We call this variation memoryless-DIDA (M-DIDA).

What will be the policy learned by DIDA in practice? Given an augmented state x, DIDA learns to replicate the action taken by the expert on the current state s, unknown to the agent. However, the same augmented state can lead to different current states, which is summarized in the belief bps|xq. Therefore, DIDA learns the following policy

S bps|xqπEpa|sqds. (3)

This policy is similar to the policies from model-based approaches, and, for this reason, may yield sub-optimal policies in some MDPs (Liotet et al., 2021, Proposition VI.1.). In practice, the class of functions of the imitated policy πI and the loss chosen for training in step 15 of Algorithm 1 may slightly modify the policy learned by DIDA. For instance, a deterministic πI would naturally forbid to learn the distribution given in Equation (3). This is discussed in Appendix E.1.

5.4. Extension to Non-Integer Delays

We now suppose that the delay is non-integer, yet still constant. For simplicity, we assume P p0, 1q but the general case follows from similar considerations. We consider a - delay in the action execution (the case of state observation is similar).

DMDP with non-integer delays can be viewed as the result of two interleaved MDPs, M with time indexes t, t 1, . . . and M with indexes t , t 1, . . . . Those two discrete MDPs stem from a single continuous process, of which we observe only some fixed time steps, similarly to (Sutton et al., 1999). They share the same transition and reward functions. A delayed agent would see states from M while executing actions on states of M . In practice, the agent taking action at seeing state st would collect a reward rpst , atq. The transition probabilities are also affected. We define b pst |st, aq the probability of reaching st from st when action a is applied during time and b1 pst|st 1 , aq the probability of reaching st from st 1 when action a is applied during time 1 . To make the definition consistent with the regular MDP, those probabilities must satisfy that, for all ps, a, s1q P S ˆ A ˆ S

pps1|s, aq ż

S b1 ps1|z, aqb pz|s, aq dz. (4)

Clearly, even for P p0, 1q, an augmented state xt pst, at 1q P S ˆ A : X is needed in order not to lose

information about the state st b p |st, at 1q. The DMDP with augmented state can again be cast into an MDP as in Bertsekas (1987); Altman & Nain (1992), where the new transition is defined for xt pst, at 1q, xt 1 pst 1, atq P X, as,

ppxt 1|xt,aq: δapatq ż

S b1 pst 1|z,aqb pz|st,at 1q dz,

where the term δapatq ensures that the new extended state contains the action that has been applied on xt. For delays greater than 1, one needs to consider the augmented state in the space S ˆ Ar s and the previous considerations hold by first considering the integer part of the delay and then its remaining non-integer part. In this setting, we propose to use DIDA by learning an undelayed policy in M and imitating it by building an augmented state from the states in M. For clarity we provide the extension to this setting in Appendix D.

To extend the theoretical analysis to the non-integer setting, we need to extend the concept of LT -TLC. For P p0, 1q, @s, a P S ˆ A, a DMDP is LT -TLC if

W1pb p |s, aq, δsq ď LT .

6. Theoretical Analysis of the Approach

We will now provide a theoretical analysis of the approach proposed above. The role of this analysis is twofold. First, it gives insights into which expert undelayed policy is best suited to be imitated in a DMDP. Secondly, it provides general results on the value functions bounds between DMDPs and MDPs, when the latter has guarantees of smoothness, setting aside pathological counterexamples such as in (Liotet et al., 2021, Proposition VI.1.) while remaining realistic. Comparing the performances of delayed and undelayed policies via their state value functions is not trivial since they live on two different spaces (S and X S ˆ A ).

Different approaches were proposed to address this issue. In (Walsh et al., 2009, Theorem 3) the idea stems from the fact that a delayed policy in a deterministic MDP has same optimal performance as the undelayed one. Therefore, when a finite MDP is not deterministic but mildly stochastic, that is there exists ϵ such that @ps, aq P S ˆ A, Ds1, pps1|s, aq ě 1 ϵ, then, one can use a deterministic approximation to the MDP and use it to learn a delayed policy. The distance between the value function of this delayed policy, r V π and the one of the real MDP, V π as r V π V π 8 ď γϵRmax

p1 γq2 , where Rmax is the maximum absolute reward. The assumptions by Walsh et al. (2009) are quite strong and the bound grows quadratically with the effective time horizon. Another approach is proposed by Agarwal & Aggarwal (2021, Theorem 1), who compare the delayed value function V rπ to Es bp |xqr V πpsqs, which corresponds to the expected value

Delayed Reinforcement Learning by Imitation

function of the undelayed policy averaged on the current unknown state given some augmented state. However, the authors make no assumptions about smoothness.

Instead, we base our analysis on smoothness assumptions to provide our main result on the difference in performance between delayed and undelayed policies in Theorem 6.1. To obtain this result, we must first derive a delayed version of the performance difference lemma (Kakade & Langford, 2002). Its proof, as for all other results in this section, is given in Appendix B and applies to any couple of delayed and undelayed policies. Note that these results hold for either integer or non-integer constant delays. For simplicity, we state the results with belief b but b is intended if the delay is non-integer.

Lemma 6.1. [Delayed Performance Difference Lemma] Consider an undelayed policy πE and a -delayed policy rπ, with P Rě0. Then, for any x P X,

E s bp |xqr V πEpsqs V rπpxq 1 1 γ

E x1 d rπ x

E s bp |x1qr V πEpsqs E s bp |x1q a rπp |x1q

r QπEps, aqs

We can then leverage the previous result to obtain a valuable result for DMDPs, which holds for delayed policies of the form of Equation (3).

Theorem 6.1. Consider an p LP , Lrq-LC MDP and a LπLC undelayed policy πE, such that QπE is LQ -L.C.2. Let rπb be the -delayed policy defined as in Equation (3), with P Rě0. Then, for any x P X,

E s bp |xqr V πEpsqs V rπbpxq ď LQLπ

where σx b E x1 d rπb x s,s1 bp |x1q rd Sps, s1qs.

Proof sketch. The first step of the proof involves applying the delayed performance difference lemma to the l.h.s.. This brings an upper bound depending on QπEps, a1q QπEps, a2q where a1 πEp |sq and a2 rπbp |x1q. It quantifies the effect of selecting an action under rπb and then following πE instead of following πE directly. Using the Lipschitzness of QπE, this term can be bounded by a quantity in W1pπEp |sq}rπbp |x1q. From the definition of rπbp |x1q which corresponds to applying πE on the state sampled from the belief bp |x1q, an upper bound of the previous Wasserstein distance can be obtained, involving the integration of bps1|x1q W1pπEp |sq}πEp |s1qq for s1. This is precisely

2In fact, only Lipschizness in the second argument is necessary (see proof).

where the loss in performance of DIDA with respect to the expert might come from. Instead of knowing the current state s, it is constrained to select an action which performs well on the possible current states weighted by their belief given the augmented state. Using the Lipschitzness of πE completes the proof.

However, this result seems difficult to grasp because of its dependence on the term σx b . We suggest two ways to further bound this term. The first involves the time-Lipschitzness assumption of the MDP and yields Corollary 6.1.

Corollary 6.1. Under the assumptions of Theorem 6.1, adding that the MDP is LT -TLC, then, for any x P X,

E s bp |xqr V πEpsqs V rπbpxq ď 2 LT LQLπ

This first result clearly highlights the linear dependence on the delay . However, the bound does not vanish (as expected) when the MDP is deterministic. This property is verified by a second result. This second result assumes a state space in Rn equipped with the Euclidean norm and yields Corollary 6.2.

Corollary 6.2. Under the assumptions of Theorem 6.1 adding that S Ă Rn is equipped with the Euclidean norm. Then, for any x P X,

E s bp |xqr V πEpsqs V rπbpxq ď 2LQLπ

E x1 d rπb x p q

Var s bp |x1qps|x1q

Interestingly, we show that this second corollary matches a theoretical lower bound when the expert policy is optimal. We provide this lower bound in Theorem 6.2, which shows that a too irregular expert policy (with high Lipschitz constant) provides weaker guarantees.

Theorem 6.2. For every Lπ ą 0, LQ ą 0, there exists an MDP such that the optimal policy is Lπ-LC, its state action value function is LQ-LC in the second argument, but for any -delayed policy rπ, with P Rě0, and any x P X

E s bp |xqr V psqs V rπpxq ě

E x1 d π xp q

Var s bp |x1qps|x1q

where V is the value function of the optimal undelayed policy.

Proof sketch. We build an MDP where such a bounds holds. We consider a state and action space in R. Performing action

Delayed Reinforcement Learning by Imitation

a, we transition from s to s a ϵ where ϵ Np0, σ2q. The reward function is proportional to |s a|, so that the objective is to always move towards the origin In the undelayed case, knowing the current position, it is possible to achieve the optimal 0 return. Instead, when there is an observation delay, we know the state up to an error of distribution Np0, σ2q, which makes it impossible to have an expected reward smaller than Er|Np0, σ2q|s.

We provide an alternative way to derive bounds in performance in Appendix C, which provide slightly different results as discussed in Appendix C.1.

We have bounded the performance of our perfectly imitated delayed policy rπb with respect to the undelayed expert πE. However, two additional sources of performance loss have to be taken into account. First, the expert πE may be suboptimal in the undelayed MDP. Second, the imitated policy πI may not learn exactly rπb.

These theoretical results highlight two important trade-offs in practice. If the expert policy is smoother than the optimal undelayed policy, then we might miss out on some opportunities, but the delayed policy is likely to be more similar to the expert one, according to Theorem 6.1. The second trade-off concerns noisier policies. For them, the imitation step is likely to be easier, as it provides examples of how to recover from bad decisions (Laskey et al., 2017). Therefore, our imitated policy πI is likely to be more similar to rπb. However, this may decrease the performance of the expert compared to the optimal undelayed policy.

7. Experiments

7.1. Setting

As we have seen in the theoretical analysis, a smoother expert is beneficial for the performance bound of the imitated delayed policy. Therefore, in the following experiments, we consider expert policies learned with SAC. As reported in an extensive study about smooth policies (Mysore et al., 2021), the entropy-maximization framework of SAC is able to learn a smooth policy even without additional forms of regularization. To avoid ever-growing memory by storing all samples in the buffer as done in Algorithm 1, we use a maximum buffer size of 10 iterations for DIDA and overwrite the oldest iteration samples when this buffer is full. As suggested by Ross et al. (2011), we use β1 1, βiě2 0 as mixture weights for the sampling policy. The policy for DIDA is a simple feed-forward neural network. More details and all hyper-parameters are reported in Appendix E.2.

We will test DIDA and M-DIDA, along with some baselines from the state of the art, on the following environments.

Pendulum The task of the agent is to rotate a pendulum up-

ward. It is a classic experiment in delayed RL as delays are highly impacting performance due to unstable equilibrium in the upward position. We use the version from the library gym (Brockman et al., 2016).

Mujoco Continuous robotic locomotion control tasks realized with an advanced physics simulator from the library mujoco (Todorov et al., 2012). Here the main difficulty lies in the complex dynamics and in the large state and action spaces. Among the possible environments, we consider the ones that are most affected by delays, namely Walker2d, Half Cheetah, Reacher, and Swimmer.

Trading The agent trades the EUR-USD (C/$) currency pair every 10 minutes and can either buy, sell or stay flat against a fixed amount of USD, following the framework of Bisi et al. (2020) and Riva et al. (2021). We assume trading is without fees, but we do take the spread into account. To this setting, we add a delay of 50 seconds to the action execution, thus placing the experiment in the non-integer delay framework. In this environment, we leverage the knowledge of an expert which is a policy trained on years 2016-2017 by Fitted Q-Iteration (FQI Ernst et al., 2005) with XGBoost (Chen & Guestrin, 2016) as a regressor for the Q function. Only for this task, we use Extra Trees (Geurts et al., 2006) as policy for DIDA. Due to the highly stochastic nature of the environment, we consider 4 experts trained with the same configuration but a different seed. For each of these expert, 5 seeds have been used for DIDA, resulting in 20 runs of DIDA on which mean and standard deviation of the results are computed. Finally, the expert has been selected by performing validation of its hyper-parameters on 2018, it is therefore possible to do validation on the delayed dataset of 2018 in order to select an expert which, albeit trained on undelayed data, performs well on delayed data. We refer to this expert as delayed expert. We compare the performance of the expert, undelayed expert and DIDA on year 2019.

The baselines for comparison with our algorithm include a memoryless and an augmented version of TRPO (M-TRPO and A-TRPO respectively), D-TRPO and L2-TRPO (Liotet et al., 2021), SARSA (Sutton & Barto, 2018) and d SARSA (Schuitema et al., 2010). The last two algorithms involve state discretization and are thus tested on pendulum only. We consider also augmented SAC (A-SAC), considered also by Bouteiller et al. (2020), and memoryless SAC (M-SAC). Although SAC can be trained at every step as we do for Pendulum, we restrict training to every 50 steps on Mujoco to speed up the procedure and reduce memory usage. We have also considered adding DCAC (Bouteiller et al., 2020) but for computational reasons, we have decided not to include it. Early experimental results showed that its running time was more than 50 times the one of DIDA. For a fair comparison to the baselines, which learn a policy

Delayed Reinforcement Learning by Imitation

0.0 0.5 1.0 1.5 2.0 2.5 Steps 1e6

DIDA training starts

Undelayed expert

M-TRPO A-TRPO L2-TRPO D-TRPO SARSA d SARSA M-SAC A-SAC BC-DIDA M-DIDA DIDA

(a) Pendulum.

0 1 2 3 4 5 Steps 1e6

Undelayed expert

DIDA training starts

M-TRPO A-TRPO L2-TRPO D-TRPO M-SAC A-SAC M-DIDA DIDA

(b) Half Cheetah.

0 1 2 3 4 5 Steps 1e6

Undelayed expert

DIDA training starts

M-TRPO A-TRPO L2-TRPO D-TRPO M-SAC A-SAC M-DIDA DIDA

(c) Reacher.

Figure 1: For a 5-steps delay, mean return and one standard deviation (shaded) as a function of the number of steps sampled from the environment (10 seeds).

0 1 2 3 4 5 Steps 1e6

Undelayed expert

DIDA training starts

M-TRPO A-TRPO L2-TRPO D-TRPO M-SAC A-SAC M-DIDA DIDA

(a) Swimmer.

0 1 2 3 4 5 Steps 1e6

Undelayed expert

DIDA training starts

M-TRPO A-TRPO L2-TRPO D-TRPO M-SAC A-SAC M-DIDA DIDA

(b) Walker2d.

Figure 2: For a 5-steps delay, mean return and one standard deviation (shaded) as a function of the number of steps sampled from the environment (10 seeds).

3 5 10 15 20 Delay

M-TRPO A-TRPO L2-TRPO D-TRPO SARSA d SARSA M-SAC A-SAC M-DIDA DIDA

Figure 3: Mean return and its standard deviation (shaded) as a function delay (10 seeds).

from scratch, we include the training steps of the expert in the step count of DIDA, as indicated by the vertical dotted line in the figures.

7.2. Results

As we can see from the results on pendulum and mujoco, Figures 1a to 1c, 2a and 2b, DIDA is able to converge much faster than the baselines in any environment, with the exception of A-SAC on the pendulum environment. In less than half a million steps on mujoco, and 250.000 steps on pendulum, DIDA almost reaches its final performance. In Half Cheetah and Reacher, we note that, although the best delayed algorithm, DIDA performs much worse than the expert. Surprisingly, in Swimmer, DIDA performs slightly better than the undelayed expert. All these phenomenons might actually come from a single cause. In our implementation of a delayed environment, we start from an undelayed environment to which is added a wrapper simulating the effect of the delay. The initialization of this process involves sampling a sequence of random actions that are

applied in the undelayed environment in order to sample the first augmented state. Depending on the environment, this random sequence of actions could cause the agent to start in disadvantageous or advantageous states. For instance3, in Half Cheetah, the random action have put the agent head-down when the the latter is first allowed to control the environment. It must thus first get back on its feet before starting to move. On the contrary, in a simpler environment like Swimmer, the initial random action queue might give some initial speed to the agent, yielding higher rewards at the beginning than its undelayed counterpart.

We provide another experiment on Pendulum where we study the robustness of DIDA, as compared to baselines, against an increase in the delay for fixed hyper-parameters. We report the final mean return per episode for different values of the delay in Figure 3. Clearly, from all the baselines studied, DIDA is the most robust to the increase in the delay.

3We provide a visual interpretation of these phenomenons here.

Delayed Reinforcement Learning by Imitation

(a) Expert 2016-2017.

(b) DIDA 2016-2017.

(c) DIDA 2019.

Figure 4: Comparison of the actions selected by the expert and the 10th iteration of DIDA on the Trading environment, during train (2016-2017) and test (2019). Results show the action (colour) selected for each day (row) at each time of the day (column).

Figure 5: Evolution of the return of DIDA for the trading of the EUR-USD pair in 2019. Performance is computed in percentage w.r.t. the invested amount.

For the Trading, being a batch-RL task since the training dataset of 2016-2017 is a fixed set of historical exchange rates, it is necessary to select the hyperparameters of DIDA by validation on 2018 to avoid overfitting. The second iteration of DIDA has been selected by validation. The results on the test set of 2019 are shown in Figure 5. The results show the ability of DIDA to adapt to non-integer delays and maintain a positive return, which is not a simple when taking the spread into account in trading the EUR-USD. It clearly outperforms the delayed expert on this task. Surprisingly, the delayed policy is able to outperform the expert on the first period of the test. This could be explained by the fact that the expert undelayed policy may have overfitted the training set while the imitation learning of an undelayed policy acted as a regularization. We provide an analysis on

the policy learned by DIDA with respect to the expert in Figure 4. It illustrates the specific overfitting problem of the Trading task caused by the batch-RL scenario by showing the policy of DIDA at the 10th and last iteration compared to the expert policy. Interestingly, the allocations of all the presented policies follow peculiar daily patterns. Figure 4ab, where allocations on the training set are presented, shows that DIDA s policy replicates quite faithfully expert policy temporal patterns. On the other hand in Figure 4c, where test set is shown, it can be noticed that DIDA s allocations start to shift away from the expert pattern.

Finally, we provide in Appendix E.3 additional experiments on different stochastic versions of pendulum.

8. Conclusion

In this paper, we explored the possibility of splitting delayed reinforcement learning into easier tasks, traditional undelayed reinforcement learning on the one hand, and imitation learning on the other one. We provided a theoretical analysis demonstrating bounds on the performance of a delayed policy compared to undelayed experts, both for integer or non-integer constant delays. These bounds apply in our particular setting but are also of interest in general for delayed policies. This guided us in the creation of our algorithm, DIDA, which learns a delayed policy by imitating an undelayed expert using DAGGER. We have empirically shown that this idea, although rather simple, provides excellent results in practice, achieving high performance with remarkable sample efficiency and light computations. We believe that our work paves the way for many possible generalizations, which include stochastic delays and particular situations in which an undelayed simulator is not available, but where an undelayed dataset can be artificially created from delayed trajectories in order to train an expert offline.

Delayed Reinforcement Learning by Imitation

Agarwal, M. and Aggarwal, V. Blind decision making: Reinforcement learning with delayed observations. In Proceedings of the International Conference on Automated Planning and Scheduling, volume 31, pp. 2 6, 2021.

Altman, E. and Nain, P. Closed-loop control with delayed information. ACM sigmetrics performance evaluation review, 20(1):193 204, 1992.

Bertsekas, D. P. Dynamic Programming: Determinist. and Stochast. Models. Prentice-Hall, 1987.

Bisi, L., Liotet, P., Sabbioni, L., Reho, G., Montali, N., Restelli, M., and Corno, C. Foreign exchange trading: a risk-averse batch reinforcement learning approach. In Proceedings of the First ACM International Conference on AI in Finance, pp. 1 8, 2020.

Bouteiller, Y., Ramstedt, S., Beltrame, G., Pal, C., and Binas, J. Reinforcement learning with random delays. In International Conference on Learning Representations, 2020.

Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., and Zaremba, W. Openai gym, 2016.

Chen, B., Xu, M., Li, L., and Zhao, D. Delay-aware modelbased reinforcement learning for continuous control. Neurocomputing, 450:119 128, 2021.

Chen, T. and Guestrin, C. Xgboost: A scalable tree boosting system. In Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining, pp. 785 794, 2016.

Derman, E., Dalal, G., and Mannor, S. Acting in delayed environments with non-stationary markov policies. ar Xiv preprint ar Xiv:2101.11992, 2021.

Dro zdziel, P., Tarkowski, S., Rybicka, I., and Wrona, R. Drivers reaction time research in the conditions in the real traffic. Open Engineering, 10(1):35 47, 2020.

Dugard, L. and Verriest, E. I. Stability and control of timedelay systems, volume 228. Springer, 1998.

Dulac-Arnold, G., Mankowitz, D., and Hester, T. Challenges of real-world reinforcement learning. ar Xiv preprint ar Xiv:1904.12901, 2019.

Ernst, D., Geurts, P., and Wehenkel, L. Tree-based batch mode reinforcement learning. Journal of Machine Learning Research, 6(Apr):503 556, 2005.

Firoiu, V., Ju, T., and Tenenbaum, J. At human speed: Deep reinforcement learning with action delay. ar Xiv preprint ar Xiv:1810.07286, 2018.

Geurts, P., Ernst, D., and Wehenkel, L. Extremely randomized trees. Machine learning, 63(1):3 42, 2006.

Gu, K. and Niculescu, S.-I. Survey on recent results in the stability and control of time-delay systems. J. Dyn. Sys., Meas., Control, 125(2):158 165, 2003.

Haarnoja, T., Zhou, A., Abbeel, P., and Levine, S. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In International conference on machine learning, pp. 1861 1870. PMLR, 2018.

Kakade, S. and Langford, J. Approximately optimal approximate reinforcement learning. In In Proc. 19th International Conference on Machine Learning. Citeseer, 2002.

Katsikopoulos, K. V. and Engelbrecht, S. E. Markov decision processes with delays and asynchronous cost collection. IEEE transactions on automatic control, 48(4): 568 574, 2003.

Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization. ar Xiv preprint ar Xiv:1412.6980, 2014.

Laskey, M., Lee, J., Fox, R., Dragan, A., and Goldberg, K. Dart: Noise injection for robust imitation learning. In Conference on robot learning, pp. 143 156. PMLR, 2017.

Liotet, P., Venneri, E., and Restelli, M. Learning a belief representation for delayed reinforcement learning. In 2021 International Joint Conference on Neural Networks (IJCNN), pp. 1 8. IEEE, 2021.

Mahmood, A. R., Korenkevych, D., Komer, B. J., and Bergstra, J. Setting up a reinforcement learning task with a real-world robot. In 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 4635 4640. IEEE, 2018.

Metelli, A. M., Mazzolini, F., Bisi, L., Sabbioni, L., and Restelli, M. Control frequency adaptation via action persistence in batch reinforcement learning. In International Conference on Machine Learning, pp. 6862 6873. PMLR, 2020.

Mysore, S., Mabsout, B., Mancuso, R., and Saenko, K. Regularizing action policies for smooth control with reinforcement learning. In 2021 IEEE International Conference on Robotics and Automation (ICRA), pp. 1810 1816. IEEE, 2021.

Nair, V. and Hinton, G. E. Rectified linear units improve restricted boltzmann machines. In Icml, 2010.

Delayed Reinforcement Learning by Imitation

Osa, T., Pajarinen, J., Neumann, G., Bagnell, J. A., Abbeel, P., and Peters, J. An algorithmic perspective on imitation learning. ar Xiv preprint ar Xiv:1811.06711, 2018.

Papamakarios, G., Pavlakou, T., and Murray, I. Masked autoregressive flow for density estimation. ar Xiv preprint ar Xiv:1705.07057, 2017.

Puterman, M. L. Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons, 2014.

Rachelson, E. and Lagoudakis, M. G. On the locality of action domination in sequential decision making. 2010.

Riva, A., Bisi, L., Liotet, P., Sabbioni, L., Vittori, E., Pinciroli, M., Trapletti, M., and Restelli, M. Learning fx trading strategies with fqi and persistent actions. 2021.

Ross, S. and Bagnell, D. Efficient reductions for imitation learning. In Proceedings of the thirteenth international conference on artificial intelligence and statistics, pp. 661 668. JMLR Workshop and Conference Proceedings, 2010.

Ross, S., Gordon, G., and Bagnell, D. A reduction of imitation learning and structured prediction to no-regret online learning. In Proceedings of the fourteenth international conference on artificial intelligence and statistics, pp. 627 635. JMLR Workshop and Conference Proceedings, 2011.

Schuitema, E., Bus oniu, L., Babuˇska, R., and Jonker, P. Control delay in reinforcement learning for real-time dynamic systems: a memoryless approach. In 2010 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 3226 3231. IEEE, 2010.

Schulman, J., Levine, S., Abbeel, P., Jordan, M., and Moritz, P. Trust region policy optimization. In International conference on machine learning, pp. 1889 1897. PMLR, 2015a.

Schulman, J., Moritz, P., Levine, S., Jordan, M., and Abbeel, P. High-dimensional continuous control using generalized advantage estimation. ar Xiv preprint ar Xiv:1506.02438, 2015b.

Sutton, R. S. and Barto, A. G. Reinforcement learning: An introduction. MIT press, 2018.

Sutton, R. S., Precup, D., and Singh, S. Between mdps and semi-mdps: A framework for temporal abstraction in reinforcement learning. Artificial intelligence, 112(1-2): 181 211, 1999.

Todorov, E., Erez, T., and Tassa, Y. Mujoco: A physics engine for model-based control. In 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 5026 5033. IEEE, 2012.

Villani, C. Optimal transport: old and new, volume 338. Springer, 2009.

Walsh, T. J. et al. Learning and planning in environments with delayed feedback. Autonomous Agents and Multi Agent Systems, 18(1):83, 2009.

Wilcox, J. W. The effect of transaction costs and delay on performance drag. Financial Analysts Journal, 49(2): 45 54, 1993.

Xu, T., Li, Z., and Yu, Y. Error bounds of imitating policies and environments. Advances in Neural Information Processing Systems, 33, 2020.

Delayed Reinforcement Learning by Imitation

A. General Results

A.1. Bounds involving the Wasserstein distance

Proposition A.1. Let X,Y be two random variables on R with distribution π0, π1 respectively. Then,

|Er Xs Er Y s| ď W1pπ0}π1q.

Proof. One has

Er Xs Er Y s ż

R xpπ0pxq π1pxqqdx ď W1pπ0}π1q,

since x ÞÑ x is 1-LC. The same holds for Er Y s Er Xs, since the Wasserstein distance is symmetric.

The next result asserts that if one applies a L-LC function to two random variables, one gets two random variables with distribution whose Wasserstein distance is bounded by the original Wasserstein distance multiplied by a factor L. Proposition A.2. Let g : ΩÑ R be an L-LC function and π0, π1 two probability measures over the metric space Ω. Note gπ the distribution of the random variable gp Xq where X is distributed according to π. Then,

W1pgπ0}gπ1q ď LW1pπ0}π1q.

Proof. By definition of Wasserstein distance,

W1pgπ0}gπ1q sup }f}Lď1

R fpxqpgπ0pxq gπ1pxqqdx ˇˇˇˇ

R fpxqgπ0pxqdx ż

R fpxqgπ1pxqdx ˇˇˇˇ .

We can then use the definitions of gπ0 and gπ1 to rewrite the previous formula in terms of expected values

W1pgπ0}gπ1q sup }f}Lď1

R fpgpxqqπ0pxqdx ż

R fpgpxqqπ1pxqdx ˇˇˇˇ

R fpgpxqqpπ0pxq π1pxqqdx ˇˇˇˇ .

Since g is Lg-LC by assumption, the composition fpgpxqq is still Lg-LC, so

W1pgπ0}gπ1q ď Lg W1pπ0}π1q.

Proposition A.3. Consider an MDP with π a policy such that its state-action value function is Lipschitz with constant LQ in the second argument, i.e. it satisfies, for all s P S and a, a1 P A ˇˇQπps, aq Qπps, a1q ˇˇ ď LQ d Apa, a1q,

then, for every couple of probability distributions η, ν over A, one has that ˇˇˇˇˇˇ E X η Y ν r Qπps, Xq Qπps, Y qs

ˇˇˇˇˇˇ ď LQW1pηp q}νp qq.

Proof. We note gη and gν the respective distributions of Qπps, Xq and Qπps, Y q. First of all, we can apply Proposition A.1 to say that ˇˇˇˇˇˇ E X η Y ν r Qπps, Xq Qπps, Y qs

ˇˇˇˇˇˇ ď W1pgη}gνq.

Delayed Reinforcement Learning by Imitation

For a fixed s P S, the two random variables Qπps, Xq and Qπps, Y q can be seen as the application of the LQ-Lipschitz function Qπps, q : A Ñ R to X and Y , respectively. This satisfies the assumptions of Proposition A.2, therefore

W1p Qπps, ηp qq}Qπps, νp qqq ď LQW1pηp q}νp qq.

Proposition A.4. Consider an LP transition function p and an Lπ policy π in some MDP M. Then, for any f : S Ñ R which is 1-LC, we have that the function g : S Ñ R given by

A pps1|s, aqπpa|sq da ds1

is Lipschitz with constant Lpp1 Lπq

Proof. Let s, z P S, one has

|gpsq gpzq| ˇˇˇˇ

A pps1|s, aqπpa|sq pps1|z, aqπpa|zq da ds1 ˇˇˇˇ

A pps1|s, aq pπpa|sq πpa|zqq da ds1 ˇˇˇˇ

pps1|s, aq pps1|z, aq πpa|zq da ds1 ˇˇˇˇ (5)

A pπpa|sq πpa|zqq ż

S fps1qpps1|s, aq ds1 da ˇˇˇˇ looooooooooooooooooooooooooooomooooooooooooooooooooooooooooon

S fps1q pps1|s, aq pps1|z, aq ds1 da ˇˇˇˇ loooooooooooooooooooooooooooooooomoooooooooooooooooooooooooooooooon

where we add and remove the quantity pps1|s, aqπpa|zq in Equation (5) and use Fubini s theorem in Equation (6).

By Lipschitzness of p, we have that a ÞÑ ş

S fps1qpps1|s, aq ds1 is Lp-LC. Thus,

A pπpa|sq πpa|zqq

S fps1qpps1|s, aq ds1

ď LπLP d Sps, zq

For the second term, again, by Lipschitzness of p, we have

A LP d Sps, zqπpa|zq da ˇˇˇˇ

LP d Sps, zq.

|gpsq gpzq| ď Lpp Lπ 1q d Sps, zq.

A.2. Bounding σρ b

We provide two bounds for σρ b E x1 d rπ ρ s,s1 bp |x1q rd Sps, s1qs with ρ a distribution on S. The first uses the assumption that the

state space in is Rn and is equipped with the Euclidean norm while the second assumes that the MDP is TLC.

Delayed Reinforcement Learning by Imitation

Lemma A.1 (Euclidean bound). Consider an MDP such that S Ă Rn is equipped with the Euclidean norm. Then one has

σρ b ď E x1 d π ρ p q

Var s bp |x1qps|x1q

Proof. We derive the following results which intermediate steps are detailed after.

σρ b E x1 d rπ ρ s,s1 bp |x1q

E x1 d rπ ρ s,s1 bp |x1q

ps1 sq2 ı (7)

E x1 d rπ ρ

E s,s1 bp |x1q rps1 sq2s. (8)

Equation (7) follows from the definition of the Euclidean norm and Equation (8) is obtained by applying Jensen s inequality. To conclude, since s1 and s are i.i.d., one has that Es,s1 bp |x1q ps1 sq2 2Vars bp |x1qrss.

The following proposition is involved in the proof of the bound of σρ b when the MDP is TLC.

Proposition A.5. Consider an LT -TLC MDP. Consider any augmented state x ps1, a1, . . . , a q P S ˆ A for a given P Rě0. Then

W1 pbp |xq}δs1q ď LT

Proof. First, consider P N We proceed by induction. The case case 0 is true since the current state is known exactly without delay. The case 1 is true by the assumption of LT -TLC. Assume that the statement is true for P N, then

W1 pbp |xq}δs1q sup }f}Lď1

S fps1q bps1|xq δs1ps1q ds1 ˇˇˇˇ

S pps2|s1, a1q ż

S fps1q bps1|s2, a2, , a q δs1ps1q ds1 ˇˇˇˇ (9)

S pps2|s1, a1q ż

S fps1q bps1|s2, a2, , a q δs2ps1q ds1

S pps2|s1, a1q ż

S fps1q δs2psq δs1ps1q ds1 ˇˇˇˇ (10)

ď sup }f}Lď1

S pps2|s1, a1q ż

S fps1q bps1|s2, a2, , a q δs2ps1q ds1 ˇˇˇˇ looooooooooooooooooooooooooooooooooooooooooomooooooooooooooooooooooooooooooooooooooooooon

A W1 p Pp|s1, a1q}δs1q looooooooooomooooooooooon

where (9) hols by conditioning on the second visited state s2 and Equation (10) holds by adding and subtracting δs2. The reader may have recognized that the statement at 1 can be used to bound A while B can be bounded with the LT -TLC assumption. Therefore

W1 pbp |xq}δs1q ď LT ,

and the statement holds for any P N.

The proof for a general delay P Rě0 start by consider the fractional part of the delay in a similar way as described above. The bound will be the sum of a term like A where the statement for integer delays can be applied and a term t u LT where t u is fractional part of .

Delayed Reinforcement Learning by Imitation

Lemma A.2 (Time-Lipschitz bound). Consider an LT -Lipschitz MDP with delay . Then, one has

σρ b ď 2 LT

Proof. Call sx the state contained in x. By triangular inequality, one has

σρ b E x d rπ ρ s,s1 bp |xq

ď E x d rπ ρ s bp |x1q

rd Sps, sxqs E x d rπ ρ s1 bp |xq

d Spsx, s1q

2 E x d rπ ρ

S d Sps, sxqbps|xq ds ȷ

2 E x d rπ ρ

S d Sps, sxq pbps|xq δsxpsqq ds ȷ (11)

ď 2 E x d rπ ρ r W1 pbp |xq}δsxqs , (12)

where Equation (11) holds because ş

S d Sps, sxqδsxpsqds 0 and Equation (12) follows by recognizing the Wasserstein distance. One can then use Proposition A.5 on each of the two terms to conclude.

B. Bounding the Value Function via Performance Difference Lemma

Lemma 6.1. [Delayed Performance Difference Lemma] Consider an undelayed policy πE and a -delayed policy rπ, with P Rě0. Then, for any x P X,

E s bp |xqr V πEpsqs V rπpxq 1 1 γ

E x1 d rπ x

E s bp |x1qr V πEpsqs E s bp |x1q a rπp |x1q

r QπEps, aqs

Proof. We first prove the result for integer delay P N. We start by adding and subtracting Es bp |x1q a rπp |x1q

rps, aq γ Es1 pp |s,aqr V πEps1qs to the quantity of interest Ipxq Es bp |xqr V πEpsqs V rπpxq.

This yields:

Ipxq E s bp |xqr V πEpsqs E s bp |x1q a rπp |x1q

rps, aq γ E s1 pp |s,aqr V πEps1qs ȷ

looooooooooooooooooooooooooooooooooooooomooooooooooooooooooooooooooooooooooooooon

E s bp |x1q a rπp |x1q

rps, aq γ E s1 pp |s,aqr V πEps1qs ȷ V rπpxq

loooooooooooooooooooooooooooooooomoooooooooooooooooooooooooooooooon

The first term is

A E s bp |xqr V πEpsqs E s bp |x1q a rπp |x1q

r QπEps, aqs .

Delayed Reinforcement Learning by Imitation

For the second term, note that V rπpxq Es bp |x1q a rπp |x1q rrps, aqs γ Ex1 rpp |x,aq a rπp |x1q r V rπpx1qs. Therefore,

B γ E s bp |x1q a rπp |x1q

E s1 pp |s,aqr V πEps1qs ȷ γ E x1 rpp |x,aq a rπp |x1q

r V rπpx1qs.

By observing that ş

S bps|xqpps1|s, aqds ş

X ppx1|x, aqbps1|x1qdx1, that is, the probability of the next state conditioned on the current augmented state x and the action a can be computed by either conditioning on the probability of the current unobserved state s or conditioning on the next augmented state x1.

B γ E x1 rpp |x,aq a rπp |x1q

E s bp |x1qr V πEps1qs V rπpx1q ȷ

γ E x1 rpp |x,aq a rπp |x1q

where we have recognised the quantity of interest I taken at another augmented state. One can thus iterate as in the original performance difference lemma to get

E s bp |xtqr V πEpsqs E s bp |xtq a rπp |xtq

r QπEps, aqs

ˇˇˇˇˇˇˇ x0 x

1 1 γ E x1 d rπ x

E s bp |x1qr V πEpsqs E s bp |x1q a rπp |x1q

r QπEps, aqs

ffifl , (13)

where Equation (13) is obtained by recognising the discounted state distribution under policy rπ. This completes the proof.

We now assume non-integer delay, setting P p0, 1q but the proof for P R follows easily. The proof is the same as above except for substituting b with b . One step which might not be evident is that ż

S b pst |xtqppst 1 |st , aq dst ż

X ppxt 1|xt, aqb pst 1 |xt 1q dxt 1.

This is true because, for xt pst, at 1q, xt 1 pst 1, atq P X, ż

S b pst |xtqppst 1 |st , aq dst

S b pst |xtq ż

S b pst 1 |st 1, aqb1 pst 1|st , aq dst 1 dst (14)

S b pst 1 |st 1, aq ż

S b pst |xtqb1 pst 1|st , aq dst dst 1

S b pst 1 |st 1, aq ż

S b pst |st, at 1qb1 pst 1|st , aq dst dst 1

S b pst 1 |st 1, atqδapatq ż

S b pst |st, at 1qb1 pst 1|st , aq dst dst 1 dat (15)

X b pst 1 |xt 1q ppxt 1|xt, aq dxt 1

where Equation (14) holds by replacing the transition p as in Equation (4) and Equation (15) holds by definition of the transition in the augmented MDP.

Theorem 6.1. Consider an p LP , Lrq-LC MDP and a Lπ-LC undelayed policy πE, such that QπE is LQ -L.C.4. Let rπb be

4In fact, only Lipschizness in the second argument is necessary (see proof).

Delayed Reinforcement Learning by Imitation

the -delayed policy defined as in Equation (3), with P Rě0. Then, for any x P X,

E s bp |xqr V πEpsqs V rπbpxq ď LQLπ

where σx b E x1 d rπb x s,s1 bp |x1q rd Sps, s1qs.

Proof. We first prove the result for integer delay P N. The first step in this proof is to use the results of Lemma 6.1, which yields, for any x P X:

E s bp |xqr V πEpsqs V rπbpxq ď 1 1 γ E x1 d rπb x

E s bp |x1qr V πEpsqs E s bp |x1q a rπbp |x1q

r QπEps, aqs

looooooooooooooooooooooooomooooooooooooooooooooooooon

ffiffiffiffiffiffifl

We consider the term inside the expectation over x1, called A. We reformulate this term to highlight how we then apply Proposition A.3.

A E s bp |x1q

E a1 πEp |sq a2 rπbp |x1q

r QπEps, a1q QπEps, a2qs

ď LQ E s bp |x1q

W1pπEp |sq}rπbp |x1qq . (16)

To finish the proof, it remains to upper bound Es bp |x1q r W1pπEp |sq}rπbp |x1qqs with σx b E x1 d rπb x s,s1 bp |x1q rd Sps, s1qs.

W1pπEp |sq}rπbp |x1qq sup }f}Lď1

A fpaqpπEpa|sq rπbpa|x1qqda ˇˇˇˇ

A fpaqpπEpa|sq ż

S πEpa|s1qbps1|x1qds1qda ˇˇˇˇ (17)

S bps1|x1q sup }f}Lď1

A fpaqpπEpa|sq πEpa|s1qqda ˇˇˇˇ ds1 (18)

S bps1|x1q W1pπEp |sq}πEp |s1qqds1

S bps1|x1q d Sps, s1qds1, (19)

where Equation (17) holds by definition of the optimal imitated delayed policy (see Equation (3)), Equation (18) holds by application of Fubini-Tonelli s theorem and Equation (19) holds by Lipschitzness of the expert undelayed policy. By re-injecting this result into Equation (16) we get the desired result.

We now assume non-integer delay, setting P p0, 1q but the proof for P R follows easily. In this case, the optimal policy learnt by DIDA is

S b ps|xqπEpa|sqds. (20)

The proof remains the same except for the first step, where Lemma 6.1 is of course applied in its non-integer delay version.

Delayed Reinforcement Learning by Imitation

Corollary 6.2. Under the assumptions of Theorem 6.1 adding that S Ă Rn is equipped with the Euclidean norm. Then, for any x P X,

E s bp |xqr V πEpsqs V rπbpxq ď 2LQLπ

E x1 d rπb x p q

Var s bp |x1qps|x1q

Proof. The result follows from application of Lemma A.1 to Theorem 6.1.

Corollary 6.1. Under the assumptions of Theorem 6.1, adding that the MDP is LT -TLC, then, for any x P X,

E s bp |xqr V πEpsqs V rπbpxq ď 2 LT LQLπ

Proof. The result follows from application of Lemma A.2 to Theorem 6.1.

B.1. Lower Bounding the Value Function

Theorem 6.2. For every Lπ ą 0, LQ ą 0, there exists an MDP such that the optimal policy is Lπ-LC, its state action value function is LQ-LC in the second argument, but for any -delayed policy rπ, with P Rě0, and any x P X

E s bp |xqr V psqs V rπpxq ě

E x1 d π xp q

Var s bp |x1qps|x1q

where V is the value function of the optimal undelayed policy.

Proof. We consider an MDP M p S, A, p, R, µ, γq such that S R and A R, its state transition is given by st 1 st a Lπ εt, where εt i.i.d. Np0, σ2q. The transition distribution can be written as

pps1|s, aq N ˆ s1; s a

Defining Lr : LQLπ, the reward is given by rps, aq Lr ˇˇˇs a Lπ

ˇˇˇ. Note that the reward is always negative, yet the

policy π p |sq δ Lπsps1q always yields 0 reward and is therefore optimal. Clearly, its value function satisfies V psq 0 for every s P S. Its Q function is

Q ps, aq Lr

R V ps1q pps1|s, aq ds1

LQ |Lπs a| .

Therefore, Q is indeed LQ-LC in the second argument.

Consider now any -delayed policy rπ. At each time step t, the current state st can be decomposed in this way

aτ Lπ looooooooooooomooooooooooooon

τ t 1 ετ looooomooooon

Delayed Reinforcement Learning by Imitation

where the first quantity is a deterministic function of the augmented state, while the second is distributed under Np0, dσ2q. The expected value of the instantaneous reward is then given by

Errpxt, aqs Lr E ˇˇˇˇst a

ˇˇˇˇˇϕpxtq a

The function f : y ÞÑ E r|Npy, σq|s has minimum at 0 by symmetry of the normal distribution. Its value is the mean of a half-normal distribution, that is E r|Np0, σq|s ?

2 ?πσ. Therefore

Errpxt, aqs ď Lr

which implies

V rπpxtq ď Lr

Var s bp |x1qps|x1q,

by noticing that Lr LQLπ and that Vars bp |xqps|xq dσ2. Note that b

Vars bp |xqps|xq is the same for each x P X so

we can replace it with Ex1 d π xp q b

Vars bp |x1qps|x1q ı to have a result more similar to Theorem 6.1.

Recalling that the optimal value function had value 0 at any state concludes the proof.

C. Bounding the Value Function via the State Distribution

For this bound, we wish to use the difference in state distribution between the delayed and the undelayed expert to grasp their difference. Obviously, they do not share the same state space since the DMDP is handled by augmenting the state. However, it is possible to define a unifying framework with the following object. In this section, we consider P N.

Definition C.1 ( th-order MDP). Given an MDP M p S, A, p, R, µq, we define its correspondent th-order MDP, P N, as the MDP Ď M (a s will be used to refer to an element of an th-order MDP) with

State space s S S 1 ˆ A , whose states sare composed of the last 1 states and actions of the MDP, namely s ps1, a1, s2, a2, . . . , s , a , s 1q.

Unchanged action space A.

Reward function R, overwriting the undelayed notation but using as input a th-order state, such that Rpst, aq Rpst, aq. The overwriting is justified by this equality.

Transition function sp given by

spps1|s, aq pps1 1|s 1, aqδapa1 q

i 1 δsi 1ps1 iq

i 1 δai 1pa1 iq, (21)

where s ps1, a1, . . . , a , s 1q and s1 ps1 1, a1 1, . . . , a1 , s1 1q .

The initial state distribution sµ is such that the inital action queue x is distributed as in the delayed MDP while the states of the states queue are distributed as si 1 pp |si, aiq.

This definition is inspired from the concept of th-order Markov chain. In this definition, s 1 is intended to be the current state, while s1 is the -delayed state, a1: is the action queue. Therefore, from the state of the th-order MDP, one can

Delayed Reinforcement Learning by Imitation

either extract an extended state and query a delayed policy or the extract the current state and query an undelayed policy. This implies that one can define the state distribution on the th-order MDP for both an undelayed and a delayed policy. We overwrite the notations and write respectively dπ and dsπ the distribution of state on the th-order MDP. The fact that the distribution concerns a th-order state and not a state from the underlying MDP or DMDP will be clear from the notation of the variable which is sampled from this distribution. For instance, in s dπ, dπ is a distribution defined by applying π on the undelayed MDP while s dπ assumes a distribution under π on the th-order MDP.

Before deriving bounds on the th-order state probability distribution, we first prove a Lipschitzness result concerning the th-order MDP which be used in later proofs.

Lemma C.1. Consider an p LP , Lrq-LC MDP and its th-order MDP counterpart. Let f : s S Ñ R such that f L ď 1 w.r.t. to the L2-norm on s S. Then, the function

gf : s S ˆ A ÝÑ R

ps, aq ÞÝÑ ż

s S fps1qspps1|s, aq ds1,

is LP -LC w.r.t. the second variable.

Proof. Let s P s S such that s ps1, a1, . . . , a , s 1q and a, b P A. Then,

|gfps, aq gfps, bq| ˇˇˇˇ

s S fps1q spps1|s, aq spps1|s, bq ds1 ˇˇˇˇ

SˆA fps2, a2, . . . , s 1, a1 1, s1 1q (22)

pps1 1|s 1, aqδapa1 1q pps1 1|s 1, bqδbpa1 1q ds1 1 da1 1

where in Equation (23) we integrate over the elements of s fixed by the Dirac distributions of Equation (21) except for the last action. Note that h : ps, aq ÞÑ fps2, a2, . . . , s 1, a, sq is 1-LC because f is 1-LC. We add and substract the quantity pps1 1|s 1, bqδapa 1q inside the integral to get

|gfps, aq gfps, bq| ď ˇˇˇˇ

SˆA δapa1 1qhpa1 1, s1 1q pps1 1|s 1, aq pps1 1|s 1, bq ds1 1 da1 1

SˆA hpa1 1, s1 1qpps1 1|s 1, bq δapa1 1q δbpa1 1q ds1 1 da1 1

ď p LP 1q d Apa, bq,

where the last inequality follows by integrating over s1 1 for the first integral and a1 1 for the second, before taking the supremum over functions h and recognising the Wasserstein distance.

We can now prove a first important intermediary result which bounds the Wasserstein divergence in discounted state distribution in the th-order MDP between the undelayed and the delayed policy.

Theorem C.1. Consider an p LP , Lrq-LC MDP M and its p L s P , Lrq-LC th-order MDP counterpart Ď M. Let πE be a Lπ-LC undelayed policy and assume that γLP p1 Lπq ď 1. Let rπb be a delayed policy as defined in Equation (3). Then, the two discounted state distributions dπE sµ , drπb sµ defined on Ď M satisfy

W1 dπE sµ }drπb sµ ď γLQp1 LP qσ rµ b

where σ rµ b E x d rπb rµ s,s1 bp |xq rd Sps, s1qs and drπb rµ is defined on the DMDP.

Delayed Reinforcement Learning by Imitation

Proof. We start by developing the term W1 dπE sµ }drπb sµ , using the supremum over the space of functions f : s S Ñ R such

that f L ď 1 w.r.t. to the L2-norm on s S.

W1 dπE sµ }drπb sµ sup }f}Lď1

s S fpsq dπE sµ psq drπb sµ psq ds ˇˇˇˇ .

We then use the fact that for some policy π P Π, dπ sµpsq p1 γqsµpsq γ ş s S spπps|s1qdπ sµps1q ds1, where spπps|s1q ş

A pps1|s, aqπpa|sq da to yield

W1 dπE sµ }drπb sµ γ sup }f}Lď1

spπEps|s1qdπE sµ ps1q sprπbps|s1qdrπb sµ ps1q ds1 ds ˇˇˇˇ

ď γ sup }f}Lď1

dπE sµ ps1q drπb sµ ps1q spπEps|s1q ds1 ds ˇˇˇˇ loooooooooooooooooooooooooooooooooooomoooooooooooooooooooooooooooooooooooon

γ sup }f}Lď1

spπEps|s1q sprπbps|s1q drπb sµ ps1q ds1 ds ˇˇˇˇ looooooooooooooooooooooooooooooooooooomooooooooooooooooooooooooooooooooooooon

where Equation (24) follows by adding and subtracting the term spπEps|s1qdrπb sµ ps1q and using the triangular inequality.

The first term A can be bounded by using Fubini s theorem first and then leveraging Proposition A.4 which implies that ş s S fpsqspπEps|s1q ds1 is LP p1 Lπq-LC. Therefore

A ď LP p1 Lπq W1 dπE sµ }drπb sµ

Now looking at the second term B, we will develop the term spπE and sprπb to highlight the influence of the policy and leverage Equation (3). We note s ps1, a1, . . . , a , s 1q and s1 ps1 1, a1 1, . . . , a1 , s1 1q. Moreover, we overwrite the notation of the belief to use it on a th-order state such that bpz|sq bpz|s1, a1, . . . , a q, that is, the belief is based on the augmented state constructed from the oldest state inside s and the sequence of action it contains.

B sup }f}Lď1

A spps|s1, aqπEpa|s1 1q da ż

z PS spps|s1, aqbpz|s1qπEpa|zq dz da drπb sµ ps1q ds1 ds ˇˇˇˇ

ˆ πEpa|s1 1q ż

z PS bpz|s1qπEpa|zq dz

loooooooooooooooooooooooooomoooooooooooooooooooooooooon

spps|s1, aq da drπb sµ ps1q ds1 ds

where the term C in this equation accounts for the difference in taking an action with the undelayed policy πE instead of the belief-based policy. Fubini s theorem yields

B sup }f}Lď1

ˆ πEpa|s1 1q ż

z PS bpz|s1qπEpa|zq dz ˆż

s S fpsqspps|s1, aq ds

looooooooooooomooooooooooooon

da drπb sµ ps1q ds1

where we note gfps1, aq : ş s S fpsqspps|s1, aq ds. This function is p1 LP q-LC in a by Lemma C.1. Noting also that πEpa|s1 1q ş

z PS πEpa|s1 1qbpz|s1qdz, one gets

B sup }f}Lď1

πEpa|s1 1q πEpa|zq gfps1, aq dabpz|s1q dz drπb sµ ps1q ds1 ˇˇˇˇ

Delayed Reinforcement Learning by Imitation

Then, by Lipschitzness of πE,

B ď Lπp1 LP q ż

z PS d Sps1 1, zqbpz|s1q dz drπb sµ ps1q ds1.

Now, one can observe that s ÞÑ ş s S δps s 1qdrπb sµ psq ds and s ÞÑ ş

X bps|xqdrπb rµ pxq dx define the same distribution over S.

Note that the discounted state distributions drπb sµ psq is over the th-order MDP s state space while drπb rµ pxq is over the DMDP s state space. This yields

B ď Lπp1 LP q ż

s PS d Sps, s1qbps|xqbps1|xq ds ds1 drπb rµ pxq dx

Lπp1 LP q E x d rπb rµ s,s1 bp |xq

Lπp1 LP qσ rµ b .

One can now resume at Equation (24), and recording that γLP p1 Lπq ď 1, one gets

W1 dπE sµ }drπb sµ ď γLP p1 Lπq W1 dπE sµ }drπb sµ γLπp1 LP qσµ b

ď γLπp1 LP q 1 γLP p1 Lπqσ rµ b .

Finally, recalling that LQ Lr 1 γLP p1 Lπq finishes the proof.

the previous results will be useful to prove a bound on the value function. However, before providing this result, we need a last intermediary result.

Proposition C.1. The th-order MDP Ď M can be reduced to an th-order MDP where the mean reward is redefined as srps, aq rps1, a1q. Because it doesn t depend upon a, we equivalently write srpsq : srps, aq

Proof. We derive a similar proof as Katsikopoulos & Engelbrecht (2003). Let V π be the value function of any stationary Markovian policy π on Ď M and s V π its value function based on the reward sr. We show that there exist a quantity Ipsq such that s V πpsq Ipsq V πpsq. Since the quantity Ipsq does not depend on π, it means that the ordering of the policies in terms of value function and thus performance is preserved by using this new reward function. Note that we can express the tth state in Ď M as a tuple of element of the underlying MDP as st pst , at , . . . , at 1, stq. We allow for negative indexing in the underlying MDP for the first term which are fixed by sµ.

We now proceed to the write s V π to introduce as a function of V πpsq.

s V πpsq E st 1 pp |st,atq at πp |stq

t 0 γtsrpst, atq

E st 1 pp |st,atq at πp |stq

t 0 γtrpst , at q

E st 1 pp |st,atq

t 0 γtrpst , at q

loooooooooooooooooooooooooooomoooooooooooooooooooooooooooon

E st 1 pp |st,atq at πp |stq

t γtrpst , at q

Two things have to be noted in the previous equation, first, the term on the left does not depend on π anymore since the actions involved in the reward collection are already contained in s. It is our sought-after quantity I. Second, the reward in the term of the right can be interpreted are regular reward of a th-order MDP since rpst, atq rpst, atq.

Delayed Reinforcement Learning by Imitation

s V πpsq Ipsq E st 1 pp |st,atq at πp |stq

t γtrpst, atq

Ipsq V πpsq.

We found such a quantity I to link s V π and V π, proving the statement.

We are now able to prove a bound between delayed and undelayed value functions.

Theorem C.2. Consider an p LP , Lrq-LC MDP M and its th-order MDP counterpart Ď M. Let πE be a Lπ-LC undelayed policy and assume that γLP p1 Lπq ď 1. Let rπb be a delayed policy as defined in Equation (3). Then, for any s P s S, ˇˇˇV πEpsq V rπbpsq ˇˇˇ ď γ 1 γ LπLQp1 LP qσx b ,

where V πE and V rπb are defined on Ď M and σx b E x d rπb x s,s1 bp |xq rd Sps, s1qs for x the augmented state contained in s and drπb x

being defined on the DMDP.

Proof. By (Puterman, 2014), the two state value functions can be written as follows.

V πEpsq 1 1 γ

s A rps1, aqπEpa|sqdπE s ps1q da ds1.

V rπbpsq 1 1 γ

s A rps1, aqrπbpa|sqdrπb s1 psq da ds1.

Writing their difference gives

V πEpsq V rπbpsq 1 1 γ

s A rps1, aq πEpa|sqdπE s ps1q rπbpa|sqdrπb s ps1q da ds1.

We now use Proposition C.1 to remove the integral over the action space.

V πEpsq V rπbpsq 1 1 γ

s A srps1q πEpa|sqdπE s1 psq rπbpa|sqdrπb s ps1q da ds1

s S srps1q dπE s ps1q drπb s ps1q ds1.

One can now use the fact that sr is Lr-LC on s S because r is Lr-LC on S ˆ A. One thus gets ˇˇˇV πEpsq V rπbpsq ˇˇˇ ď Lr 1 γ W1 dπE s }drπb s .

Applying Theorem C.1 yields ˇˇˇV πEpsq V rπbpsq ˇˇˇ ď γ 1 γ LπLQp1 LP qσx b ,

where x is the augmented state contained in s. This concludes the proof.

As for Theorem 6.1, we can now use additional assumptions to bound σµ b .

Corollary C.1. Under the conditions of Theorem C.2 and adding that S Ă Rn is equipped with the Euclidean norm. Let rπb be a -delayed policy as defined in Equation (3). Then, for any s P s S,

ˇˇˇV πEpsq V rπbpsq ˇˇˇ ď γ 1 γ LπLQp1 LP q E x1 d π xp q

Var s bp |x1qps|x1q

Delayed Reinforcement Learning by Imitation

Proof. The result follows from application of Lemma A.1 to Theorem C.2.

Corollary C.2. Under the conditions of Theorem C.2 and adding that the MDP is LT -TLC. Let rπb be a -delayed policy as defined in Equation (3). Then, V πE V rπb 8 ď 2 γ

1 γ LT LQLπp1 LP q

where V πE and V rπb are defined on s S.

Proof. First, one applies Lemma A.2 to Theorem C.2 to obtain ˇˇˇV πEpsq V rπbpsq ˇˇˇ ď 2 γ

1 γ LπLQp1 LP q,

for some s P s S. Then, taking the maximum over s S gives the result since the rhs doesn t depend on s.

C.1. Comparison of the Two Bounds

As stated in Section 6, there are several choices for the quantities to consider when comparing undelayed to delayed performance. In this paper, Theorem C.2 provides a bound on the space of th-order MDP ( s S) while Theorem 6.1 compares a value function on the space of the augmented MDP (X) to a value function on the classic state space (S).

Recall the bounds for s P s S of Theorem C.2

ˇˇˇV πEpsq V rπbpsq ˇˇˇ ď γ 1 γ LπLQp1 LP qσx b , (25)

and for x P X of Theorem 6.1

E s bp |xqr V πEpsqs V rπbpxq ď LQLπ

1 γ σx b , (26)

where σx b E x1 d rπb x s,s1 bp |x1q rd Sps, s1qs.

As opposed to what the notations may suggest, the Lipschitz constant LQ doesn t have exactly the same value. In Equation (25), we recognised in the proof the LQ of the Q function as given in Rachelson & Lagoudakis (2010) under the assumption that γLP p1 Lπq ď 1. In Equation (26) however, we only assume that there exists such constant LQ for which the Q function is LQ-LC. That includes the case when γLP p1 Lπq ď 1 but is a more general result.

The other difference between the bounds lies in the factor γp LP 1q of Equation (25). Depending on the task, this factor may be smaller or greater than 1, changing the order of the bounds.

D. DIDA for non-integer delays

We provide a midification of Algorithm 1 to account for non-integer delays in Algorithm 2. Note that we consider a delay P Rě0 and we note t u the integer part of and t u : t u its fractional part. We consider a continuous MDP which yields M and Mt u.

E. Experimental details

E.1. Imitation Loss and DIDA s Policy

As stated in the main paper, the function learnt by DIDA can drift from Equation (3) depending on the class of policies of πI and the loss function that is used for the imitation step. We derive the policy learnt by DIDA for the two following cases.

Mean squared error loss DIDA is trained on

A pa rπθpxqq2πEpa|sqbps|xqda ds

Delayed Reinforcement Learning by Imitation

Algorithm 2 DIDA for non-integer delays ( P r0, 1sq Inputs t u : t u, continuous MDP yielding M and Mt u, undelayed expert πE trained on Mt u, β routine, number of steps N, empty dataset D. Outputs: delayed policy πI

1: for βi in β-routine do 2: for j in t1, . . . , Nu do 3: if New episode then 4: Initialize state buffer ps1, s1 t u, s2, . . . , sr s, s 1q and action buffer pa1, a2 . . . , ar sq 5: end if 6: Sample a E πEp |s q, set a a E 7: if Random u Upr0, 1sq ě βi then 8: Overwrite a πIp |rs1, a1, . . . , ar ssq 9: end if 10: Aggregate dataset: D Ð D Y prs1, a1, a2, . . . , ar ss, a Eq 11: Apply a at s and get new states pst u 2, s 2q 12: Update buffers: ps1, s1 t u, s2, . . . , sr s, s 1q Ð ps2, s2 t u, s2, . . . , sr s 1, s 2q pa1, . . . , ar sq Ð pa2, . . . , ar s, aq 13: end for 14: Train πI on D 15: end for

which is minimized for θ such that

S E a πEp |sqrasbps|xqds.

That means that the policy learnt by DIDA outputs the mean value of the action given the belief and the expert policy distribution.

Kullback-Leibler loss DIDA is trained on

S DKLpπEp |sq}rπθp |xqqbps|xqds

A πEpa|sq log πEpa|xqqbps|xqda ds ż

A πEpa|sq log rπθpa|xqqbps|xqda ds

arg min θ ż

A πEpa|sq log rπθpa|xqqbps|xqda ds (27)

arg min θ ż

S πEpa|sq log rπθpa|xqqbps|xqds da (28)

S πEpa|sqbps|xq log pbps|xqπEpa|xqqq ds da ż

S πEpa|sq log rπθpa|xqqbps|xqds da

arg min θ DKL

S πEp |sqbps|xqds, rπθp |xq ,

where Equation (27) holds because the first integral does not depend on θ and Equation (28) holds by Fubini s theorem since the functions inside the integral are always negative.

E.2. Hyper-parameters

For pendulum, the test performance are obtained from interacting 1000 steps with the environment, with maximum episode length of 200. For mujoco environments, the number of steps is 1000 as well but the maximum episode length is 500. The other of hyper-parameter are given for each approach, for each environment in the following tables.

Delayed Reinforcement Learning by Imitation

Hyper-parameter Pendulum Mujoco Trading Policy type Feed-forward Feed-forward Extra Trees Iterations 245 400 10 Steps per iteration 10,000 10,000 2 years of data β sequence β1 1, βiě2 0 β1 1, βiě2 0 β1 1, βiě2 0 Max buffer size 10 iterations 10 iterations Unlimited Policy neurons r100, 100, 10s r100, 100, 100, 10s H Activations Re LU (Nair & Hinton, 2010) Re LU H Optimizer Adam pβ1 0.9, β2 0.999q (Kingma & Ba, 2014)

Adam pβ1 0.9, β2 0.999q H

Learning rate 1e 3 1e 3 H Batch size 64 64 H Min samples split H H 100 n estimators H H 100

Table 1: Hyper-parameters for DIDA

Hyper-parameter Pendulum Mujoco Epochs 2000 1000 Steps per epoch 5000 5000 Pre-training epochs 25 25 Pre-training steps 10000 10000 Backtracking line search iterations 10 10 Backtracking line search step 0.8 0.8 Conjugate gradient iterations 10 10 Discount γ 0.99 0.99 Max KL divergence 0.001 0.001 λ for GAE (Schulman et al., 2015b) 0.97 0.97 Value function neurons r64s r64s Value function iterations 3 3 Value function learning rate 0.01 0.01 Policy neurons r64, 64s r64, 64s Activations Re LU Re LU Encoder feed-forward neurons r8s r8s Encoder learning rate 0.01 0.01 Encoder iterations 2 2 Encoder dimension 64 64 Encoder heads 2 2 Encoder optimizer Adam pβ1 0.9, β2 0.999q Adam pβ1 0.9, β2 0.999q Layers of MAF (Papamakarios et al., 2017) 5 5 Maf neurons r16s r16s MAF learning rate 0.01 0.01 MAF optimizer Adam pβ1 0.9, β2 0.999q Adam pβ1 0.9, β2 0.999q Epochs of training the belief 200 200 Batch size belief learning 10000 1000 Prediction buffer size 100000 100000 Belief representation dimension 8 32

Table 2: Hyper-parameters for D-TRPO

Delayed Reinforcement Learning by Imitation

Hyper-parameter Pendulum Discount γ 0.99 Initial replay size 64 Buffer size 50000 Batch size 64 Actor µ neurons 256 Actor σ neurons 256 Actor optimizer Adam pβ1 0.9, β2 0.999q Warmup transitions 100 Polyak update τ 0.005 Entropy learning rate 3e 4 Train frequency 50

Table 3: Hyper-parameters for M-SAC and A-SAC

Hyper-parameter Pendulum Mujoco Epochs 2000 1000 Steps per epoch 5000 5000 Pre-training epochs 2 2 Pre-training steps 10000 10000 Backtracking line search iterations 10 10 Backtracking line search step 0.8 0.8 Conjugate gradient iterations 10 10 Discount γ 0.99 0.99 Max KL divergence 0.001 0.001 λ for GAE 0.97 0.97 Value function neurons r64s r64s Value function iterations 3 3 Value function learning rate 0.01 0.01 Policy neurons r64, 64s r64, 64s Activations Re LU Re LU Encoder feed-forward neurons r8s r8s Encoder learning rate 5e 3 5e 3 Encoder iterations 1 1 Encoder dimension 64 64 Encoder heads 2 2 Encoder optimizer Adam pβ1 0.9, β2 0.999q Adam pβ1 0.9, β2 0.999q Batch size belief learning 10000 1000 Prediction buffer size 100000 100000 Belief representation dimension 8 32

Table 4: Hyper-parameters for L2-TRPO

Delayed Reinforcement Learning by Imitation

Hyper-parameter Pendulum Mujoco Epochs 2000 1000 Steps per epoch 5000 5000 Pre-training epochs 2 2 Pre-training steps 10000 10000 Backtracking line search iterations 10 10 Backtracking line search step 0.8 0.8 Conjugate gradient iterations 10 10 Discount γ 0.99 0.99 Max KL divergence 0.001 0.001 λ for GAE 0.97 0.97 Value function neurons r64s r64s Value function iterations 3 3 Value function learning rate 0.01 0.01 Policy neurons r64, 64s r64, 64s Activations Re LU Re LU

Table 5: Hyper-parameters for M-TRPO and A-TRPO

Hyper-parameter Pendulum Epochs 2000 Steps per epoch 5000 Discount γ 0.99 Eligibility trace λ 0.9 Learning rate 0.1 ϵ-greedy parameter 0.2 S discretization size 15 A discretization size 3

Table 6: Hyper-parameters for SARSA

Hyper-parameter Pendulum Epochs 2000 Steps per epoch 5000 Discount γ 0.99 Eligibility trace λ 0.9 Learning rate 0.1 ϵ-greedy parameter 0.2 S discretization size 15 A discretization size 3

Table 7: Hyper-parameters for d SARSA

E.3. Further experiments

Stochastic Pendulum In this experiment, we evaluate DIDA on a stochastic environment. We follow Liotet et al. (2021) and add stochasticity to the pendulum environment. In order to do so, to the action selected by the agent, we add an i.i.d. noise of the form ϵ scalepη shiftq where η is some probability distribution. We construct 6 such noises reported in Table 8. For readability of the plots, we group noises by similarity. We build an additional noise, referred to as uniform noise, which follows the action of the agent with probability 0.9 and otherwise samples an action uniformly at random inside the action space. We place this noise in group 2. The results are obtained with the hyper-parameters given in Appendix E.2 for the pendulum environment.

Delayed Reinforcement Learning by Imitation

Noise Distribution η Shift Scale Group Beta (8,2) βp8, 2q 0.5 2 1 Beta (2,2) βp2, 2q 0.5 2 1 U-Shaped βp0.5, 0.5q 0.5 2 1 Triangular Triangularp 2, 1, 2q 0 1 2 Lognormal (1) Lognormalp0, 1q 1 1 3 Lognormal (0.1) Lognormalp0, 0.1q 1 1 3

Table 8: Distributions for the noise added to the action in the stochastic Pendulum.

(a) Noises of group 1.

(b) Noises of group 2.

(c) Noises of group 3.

Figure 6: Mean return and one standard deviation (shaded) as a function of the number of steps sampled from the environment for stochastic Pendulum with different noises (10 seeds).

Group 1: In all these cases, we are considering noises based on beta distributions, with different parameters. We can see in Figure 6a that our algorithm is able to achieve a much better performance compared to the ones of the baselines, even with a fraction of the training samples. For every algorithm, the most favourable case seems to be the second one, based on a beta p2, 2q noise. This may be because of the features of the other two noises. The first, beta p8, 2q, is non-zero mean, so that the action is affected, on average, by a translation in one direction. The third one, based on a βp0.5, 0.5q distribution, is zero-mean, but is characterized by a higher variance than the second (0.5 vs 0.2).

Group 2: Here again, we see in Figure 6b that DIDA is able to get the best performance, even if the two baselines seem to learn much faster than for group 1. This suggest that group 2 contains easier tasks, even though the triangular noise is not symmetric. Still, note that even if the probability of the random action in the first case is small, in the situation of delay it accumulates so that the probability of having a random action inside the action queue of 5 is 1 0.95 0.41. Nonetheless, DIDA seems to deal very well with this situation.

Group 3: Being strongly asymmetric and unbounded, theses noises pose more challenge to the algorithms. We report the results in Figure 6c. For the Lognormal (1) noise, no algorithm reaches a satisfactory performance. However, again, DIDA obtains the best performance for each noise.