# lipschitz_lifelong_reinforcement_learning__165be451.pdf

Lipschitz Lifelong Reinforcement Learning

Erwan Lecarpentier1, 2, David Abel3, Kavosh Asadi3, 4 , Yuu Jinnai3, Emmanuel Rachelson1, Michael L. Littman3

1ISAE-SUPAERO, Universit e de Toulouse, France 2ONERA, The French Aerospace Lab, Toulouse, France 3Brown University, Providence, Rhode Island, USA 4Amazon Web Service, Palo Alto, California, USA erwanlecarpentier@mailbox.org

We consider the problem of knowledge transfer when an agent is facing a series of Reinforcement Learning (RL) tasks. We introduce a novel metric between Markov Decision Processes and establish that close MDPs have close optimal value functions. Formally, the optimal value functions are Lipschitz continuous with respect to the tasks space. These theoretical results lead us to a value-transfer method for Lifelong RL, which we use to build a PAC-MDP algorithm with improved convergence rate. Further, we show the method to experience no negative transfer with high probability. We illustrate the beneﬁts of the method in Lifelong RL experiments.

1 Introduction

Lifelong Reinforcement Learning (RL) is an online problem where an agent faces a series of RL tasks, drawn sequentially. Transferring knowledge from prior experience to speed up the resolution of new tasks is a key question in that setting (Lazaric 2012; Taylor and Stone 2009). We elaborate on the intuitive idea that similar tasks should allow a large amount of transfer. An agent able to compute online a similarity measure between source tasks and the current target task could be able to perform transfer accordingly. By measuring the amount of inter-task similarity, we design a novel method for value transfer, practically deployable in the online Lifelong RL setting. Speciﬁcally, we introduce a metric between MDPs and prove that the optimal Q-value function is Lipschitz continuous with respect to the MDP space. This property makes it possible to compute a provable upper bound on the optimal Q-value function of an unknown target task, given the learned optimal Q-value function of a source task. Knowing this upper bound accelerates the convergence of an RMax-like algorithm (Brafman and Tennenholtz 2002), relying on an optimistic estimate of the optimal Q-value function. Overall, the proposed transfer method consists of computing online the distance between source and target tasks, deducing the upper bound on the optimal Q value function of the source task and using this bound to accelerate learning. Importantly, the method exhibits no negative transfer, i.e., it cannot cause

Kavosh Asadi ﬁnished working on this project before joining Amazon. Copyright c 2021, Association for the Advancement of Artiﬁcial Intelligence (www.aaai.org). All rights reserved.

performance degradation, as the computed upper bound provably does not underestimate the optimal Q-value function. Our contributions are as follows. First, we study theoretically the Lipschitz continuity of the optimal Q-value function in the task space by introducing a metric between MDPs (Section 3). Then, we use this continuity property to propose a value-transfer method based on a local distance between MDPs (Section 4). Full knowledge of both MDPs is not required and the transfer is non-negative, which makes the method applicable online and safe. In Section 4.3, we build a PAC-MDP algorithm called Lipschitz RMax, applying this transfer method in the online Lifelong RL setting. We provide sample and computational complexity bounds and showcase the algorithm in Lifelong RL experiments (Section 5).

2 Background and Related Work

Reinforcement Learning (RL) (Sutton and Barto 2018) is a framework for sequential decision making. The problem is typically modeled as a Markov Decision Process (MDP) (Puterman 2014) consisting of a 4-tuple S, A, R, T , where S is a state space, A an action space, Ra s is the expected reward of taking action a in state s and T a ss is the transition probability of reaching state s when taking action a in state s. Without loss of generality, we assume Ra s [0, 1]. Given a discount factor γ [0, 1), the expected cumulative return P

t γt Rat st obtained along a trajectory starting with state s and action a using policy π in MDP M is denoted by Qπ M(s, a) and called the Q-function. The optimal Q-function Q M is the highest attainable expected return from s, a and V M(s) = maxa A Q M(s, a) is the optimal value function in s. Notice that Ra s 1 implies Q M(s, a) 1 1 γ for all s, a S A. This maximum upper bound is used by the RMax algorithm as an optimistic initialization of the learned Q function. A key point to reduce the sample complexity of this algorithm is to beneﬁt from a tighter upper bound, which is the purpose of our transfer method. Lifelong RL (Silver, Yang, and Li 2013; Brunskill and Li 2014) is the problem of experiencing online a series of MDPs drawn from an unknown distribution. Each time an MDP is sampled, a classical RL problem takes place where the agent is able to interact with the environment to maximize its expected return. In this setting, it is reasonable to think that knowledge gained on previous MDPs could be re-used

The Thirty-Fifth AAAI Conference on Artificial Intelligence (AAAI-21)

to improve the performance in new MDPs. In this paper, we provide a novel method for such transfer by characterizing the way the optimal Q-function can evolve across tasks. As commonly done (Wilson et al. 2007; Brunskill and Li 2014; Abel et al. 2018), we restrict the scope of the study to the case where sampled MDPs share the same state-action space S A. For brevity, we will refer indifferently to MDPs, models or tasks, and write them M = R, T . Using a metric between MDPs has the appealing characteristic of quantifying the amount of similarity between tasks, which intuitively should be linked to the amount of transfer achievable. Song et al. (2016) deﬁne a metric based on the bisimulation metric introduced by Ferns, Panangaden, and Precup (2004) and the Wasserstein metric (Villani 2008). value transfer is performed between states with low bi-simulation distances. However, this metric requires knowing both MDPs completely and is thus unusable in the Lifelong RL setting where we expect to perform transfer before having learned the current MDP. Further, the transfer technique they propose does allow negative transfer (see Appendix, Section 1). Carroll and Seppi (2005) also deﬁne a value-transfer method based on a measure of similarity between tasks. However, this measure is not computable online and thus not applicable to the Lifelong RL setting. Mahmud et al. (2013) and Brunskill and Li (2013) propose MDP clustering methods; respectively using a metric quantifying the regret of running the optimal policy of one MDP in the other MDP and the L1 norm between the MDP models. An advantage of clustering is to prune the set of possible source tasks. They use their approach for policy transfer, which differs from the valuetransfer method proposed in this paper. Ammar et al. (2014) learn the model of a source MDP and view the prediction error on a target MDP as a dissimilarity measure in the task space. Their method makes use of samples from both tasks and is not readily applicable to the online setting considered in this paper. Lazaric, Restelli, and Bonarini (2008) provide a practical method for sample transfer, computing a similarity metric reﬂecting the probability of the models to be identical. Their approach is applicable in a batch RL setting as opposed to the online setting considered in this paper. The approach developed by Sorg and Singh (2009) is very similar to ours in the sense that they prove bounds on the optimal Q-function for new tasks, assuming that both MDPs are known and that a soft homomorphism exists between the state spaces. Brunskill and Li (2013) also provide a method that can be used for Q-function bounding in multi-task RL. Abel et al. (2018) present the Max QInit algorithm, providing transferable bounds on the Q-function with high probability while preserving PAC-MDP guarantees (Strehl, Li, and Littman 2009). Given a set of solved tasks, they derive the probability that the maximum over the Q-values of previous MDPs is an upper bound on the current task s optimal Qfunction. This approach results in a method for non-negative transfer with high probability once enough tasks have been sampled. The method developed by Abel et al. (2018) is similar to ours in two fundamental points: ﬁrst, a theoretical upper bounds on optimal Q-values across the MDP space is built; secondly, this provable upper bound is used to transfer knowledge between MDPs by replacing the maximum 1 1 γ

RMax Max QInit

Figure 1: The optimal Q-value function represented for a particular s, a pair across the MDP space. The RMax, Max QInit and LRMax bounds are represented for three sampled MDPs.

bound in an RMax-like algorithm, providing PAC guarantees. The difference between the two approaches is illustrated in Figure 1, where the Max QInit bound is the one developed by Abel et al. (2018), and the LRMax bound is the one we present in this paper. On this ﬁgure, the essence of the LRMax bound is noticeable. It stems from the fact that the optimal Q value function is locally Lipschitz continuous in the MDP space w.r.t. a speciﬁc pseudometric. Conﬁrming the intuition, close MDPs w.r.t. this metric have close optimal Q values. It should be noticed that no bound is uniformly better than the other as intuited by Figure 1. Hence, combining all the bounds results in a tighter upper bound as we will illustrate in experiments (Section 5). We ﬁrst carry out the theoretical characterization of the Lipschitz continuity properties in the following section. Then, we build on this result to propose a practical transfer method for the online Lifelong RL setting.

3 Lipschitz Continuity of Q-Functions The intuition we build on is that similar MDPs should have similar optimal Q-functions. Formally, this insight can be translated into a continuity property of the optimal Q-function over the MDP space M. The remainder of this section mathematically formalizes this intuition that will be used in the next section to derive a practical method for value transfer. To that end, we introduce a local pseudometric characterizing the distance between the models of two MDPs at a particular state-action pair. A reminder and a detailed discussion on the metrics used herein can be found in the Appendix, Section 2. Deﬁnition 1. Given two tasks M = R, T , M = R, T , and a function f : S R+, we deﬁne the pseudometric between models at (s, a) S A w.r.t. f as:

Df sa(M, M) |Ra s Ra s| + X

s S f(s )|T a ss T a ss |. (1)

This pseudometric is relative to a positive function f. We implicitly cast this deﬁnition in the context of discrete state spaces. The extension to continuous spaces is straightforward but beyond the scope of this paper. For the sake of clarity in the remainder of this study, we introduce

Dsa(M M) D γV M sa (M, M), corresponding to the pseudometric between models with the particular choice of f = γV M. From this deﬁnition stems the following pseudo-Lipschitz continuity result.

Proposition 1 (Local pseudo-Lipschitz continuity). For two MDPs M, M, for all (s, a) S A, Q M(s, a) Q M(s, a) sa(M, M), (2)

with the local MDP pseudometric sa(M, M) min dsa(M M), dsa( M M) , and the local MDP dissimilarity dsa(M M) is the unique solution to the following ﬁxed-point equation for dsa:

dsa = Dsa(M M) + γ X

s S T a ss max a A ds a , s, a. (3)

All the proofs of the paper can be found in the Appendix. This result establishes that the distance between the optimal Q-functions of two MDPs at (s, a) S A is controlled by a local dissimilarity between the MDPs. The latter follows a ﬁxed-point equation (Equation 3), which can be solved by Dynamic Programming (DP) (Bellman 1957). Note that, although the local MDP dissimilarity dsa(M M) is asymmetric, sa(M, M) is a pseudometric, hence the name pseudo-Lipschitz continuity. Notice that the policies in Equation 2 are the optimal ones for the two MDPs and thus are different. Proposition 1 is a mathematical result stemming from Deﬁnition 1 and should be distinguished from other frameworks of the literature that assume the continuity of the reward and transition models w.r.t. S A (Rachelson and Lagoudakis 2010; Pirotta, Restelli, and Bascetta 2015; Asadi, Misra, and Littman 2018).This result establishes that the optimal Q-functions of two close MDPs, in the sense of Equation 1, are themselves close to each other. Hence, given Q M, the function s, a 7 Q M(s, a) + sa(M, M) (4) can be used as an upper bound on Q M with M an unknown MDP. This is the idea on which we construct a computable and transferable upper bound in Section 4. In Figure 1, the upper bound of Equation 4 is represented by the LRMax bound. Noticeably, we provide a global pseudo-Lipschitz continuity property, along with similar results for the optimal value function V M and the value function of a ﬁxed policy. As these results do not directly serve the purpose of this article, we report them in the Appendix, Section 4.

4 Transfer Using the Lipschitz Continuity A purpose of value transfer, when interacting online with a new MDP, is to initialize the value function and drive the exploration to accelerate learning. We aim to exploit value transfer in a method guaranteeing three conditions: C1. the resulting algorithm is PAC-MDP; C2. the transfer accelerates learning; C3. the transfer is non-negative. To achieve these conditions, we ﬁrst present a transferable upper bound on Q M in Section 4.1. This upper bound stems from the Lipschitz continuity result of Proposition 1. Then, we propose a practical way to compute this upper bound in Section 4.2. Precisely, we propose a surrogate bound that can be calculated online in the Lifelong RL setting, without having explored the source and target tasks completely. Finally, we implement the method in an algorithm described in Section 4.3, and demonstrate formally that it meets conditions C1, C2 and C3. Improvements are discussed in Section 4.4.

4.1 A Transferable Upper Bound on Q M From Proposition 1, one can naturally deﬁne a local upper bound on the optimal Q-function of an MDP given the optimal Q-function of another MDP. Deﬁnition 2. Given two tasks M and M, for all (s, a) S A, the Lipschitz upper bound on Q M induced by Q M is deﬁned as U M(s, a) Q M(s, a) with:

U M(s, a) Q M(s, a) + sa(M, M). (5)

The optimism in the face of uncertainty principle leads to considering that the long-term expected return from any state is the 1 1 γ maximum return, unless proven otherwise. Particularly, the RMax algorithm (Brafman and Tennenholtz 2002), explores an MDP so as to shrink this upper bound. RMax is a model-based, online RL algorithm with PACMDP guarantees (Strehl, Li, and Littman 2009), meaning that convergence to a near-optimal policy is guaranteed in a polynomial number of missteps with high probability. It relies on an optimistic model initialization that yields an optimistic upper bound U on the optimal Q-function, then acts greedily w.r.t. U. By default, it takes the maximum value U(s, a) = 1 1 γ , but any tighter upper bound is admissible. Thus, shrinking U with Equation 5 is expected to improve the learning speed or sample complexity for new tasks in Lifelong RL. In RMax, during the resolution of a task M, S A is split into a subset of known state-action pairs K and its complement Kc of unknown pairs. A state-action pair is known if the number of collected reward and transition samples allows estimating an ϵ-accurate model in L1-norm with probability higher than 1 δ. We refer to ϵ and δ as the RMax precision parameters. This results in a threshold nknown on the number of visits n(s, a) to a pair s, a that are necessary to reach this precision. Given the experience of a set of m MDPs M = { M1, . . . , Mm}, we deﬁne the total bound as the minimum over all the induced Lipschitz bounds. Proposition 2. Given a partially known task M = R, T , the set of known state-action pairs K, and the set of Lipschitz bounds on Q M induced by previous tasks U M1, . . . , U Mm , the function Q deﬁned below is an upper bound on Q M for all s, a S A.

s S T a ss max a A Q(s , a )

if (s, a) K, U(s, a) otherwise,

with U(s, a) = min n 1 1 γ , U M1(s, a), . . . , U Mm(s, a) o .

Commonly in RMax, Equation 6 is solved to a precision ϵQ via Value Iteration. This yields a function Q that is a valid heuristic (bound on Q M) for the exploration of MDP M.

4.2 A Computable Upper Bound on Q M The key issue addressed in this section is how to actually compute U(s, a), particularly when both source and target tasks are partially explored. Consider two tasks M and M, on which vanilla RMax has been applied, yielding the respective

sets of known state-action pairs K and K, along with the learned models ˆ M = ˆT, ˆR and ˆ M = ˆ T, ˆ R , and the upper bounds Q and Q respectively on Q M and Q M. Notice that, if K = , then Q(s, a) = 1 1 γ for all s, a pairs. Conversely, if Kc = , Q is an ϵ-accurate estimate of Q M in L1-norm with high probability. Equation 6 allows the transfer of knowledge from M to M if U M(s, a) can be computed. Unfortunately, the true model and optimal value functions, necessary to compute U M, are partially known (see Equation 5). Thus, we propose to compute a looser upper bound based on the learned models and value functions. First, we provide an upper bound ˆDsa(M M) on Dsa(M M) (Deﬁnition 1). Proposition 3. Given two tasks M, M and respectively K, K the subsets of S A where their models are known with accuracy ϵ in L1-norm with probability at least 1 δ,

Pr ˆDsa(M M) Dsa(M M) 1 δ

with ˆDsa(M M) the upper bound on the pseudometric between models deﬁned below for B = ϵ 1 + γ maxs V (s ) .

Dγ V sa ( ˆ M, ˆ M) + 2B if (s, a) K K max µ M Dγ V sa ( ˆ M, µ) + B if (s, a) K Kc

max µ M Dγ V sa (µ, ˆ M) + B if (s, a) Kc K

max µ, µ M2 Dγ V sa (µ, µ) if (s, a) Kc Kc

Importantly, this upper bound ˆDsa(M M) can be calculated analytically (see Appendix, Section 7). This makes ˆDsa(M M) usable in the online Lifelong RL setting, where already explored tasks may be partially learned, and little knowledge has been gathered on the current task. The magnitude of the B term is controlled by ϵ. In the case where no information is available on the maximum value of V , we have that B = ϵ 1 γ . ϵ measures the accuracy with which the tasks are known: the smaller ϵ, the tighter the B bound. Note that V is used as an upper bound on the true V M. In many cases, maxs V M(s ) 1 1 γ ; e.g. for stochastic shortest path problems, which feature rewards only upon reaching terminal states, we have that maxs V M(s ) = 1 and thus B = (1 + γ)ϵ is a tighter bound for transfer. Combining ˆDsa(M M) and Equation 3, one can derive an upper bound ˆdsa(M M) on dsa(M M), detailed in Proposition 4. Proposition 4. Given two tasks M, M M, K the set of state-action pairs where (R, T) is known with accuracy ϵ in L1-norm with probability at least 1 δ. If γ(1 + ϵ) < 1, the solution ˆdsa(M M) of the following ﬁxed-point equation on ˆdsa (for all s, a S A) is an upper bound on dsa(M M) with probability at least 1 δ: ˆdsa = ˆDsa(M M) + (8)

s S ˆT a ss max a A ˆds a + ϵ max s ,a S A ˆds a if s, a K,

γ max s ,a S A ˆds a otherwise.

Similarly as in Proposition 3, the condition γ(1 + ϵ) < 1 illustrates the fact that for a large return horizon (large γ), a high accuracy (small ϵ) is needed for the bound to be computable. Eventually, a computable upper bound on Q M given M with high probability is given by ˆU M(s, a) = Q(s, a)

+ min n ˆdsa(M M), ˆdsa( M M) o . (9)

The associated upper bound on U(s, a) (Equation 6) given the set of previous tasks M = { Mi}m i=1 is deﬁned by

ˆU(s, a) = min n 1 1 γ , ˆU M1(s, a), . . . , ˆU Mm(s, a) o . (10)

This upper bound can be used to transfer knowledge from a partially solved source task to a target task. If ˆU(s, a) 1 1 γ on a subset of S A, then the convergence rate can be improved. As complete knowledge of both tasks is not needed to compute the upper bound, it can be applied online in the Lifelong RL setting. In the next section, we explicit an algorithm that leverages this value-transfer method.

4.3 Lipschitz RMax Algorithm In Lifelong RL, MDPs are encountered sequentially. Applying RMax to task M yields the set of known state-action pairs K, the learned models ˆT and ˆR, and the upper bound Q on Q M. Saving this information when the task changes allows computing the upper bound of Equation 10 for the new target task, and using it to shrink the optimistic heuristic of RMax. This computation effectively transfers value functions between tasks based on task similarity. As the new task is explored online, the task similarity is progressively assessed with better conﬁdence, reﬁning the values of ˆDsa(M M), ˆdsa(M M) and eventually ˆU, allowing for more efﬁcient transfer where the task similarity is appraised. The resulting algorithm, Lipschitz RMax (LRMax), is presented in Algorithm 1. To avoid ambiguities with M, we use ˆ M to store learned features ( ˆT, ˆR, K, Q) about previous MDPs. In a nutshell, the behavior of LRMax is precisely that of RMax, but with a tighter admissible heuristic ˆU that becomes better as the new task is explored (while this heuristic remains constant in vanilla RMax). LRMax is PAC-MDP (Condition C1) as stated in Propositions 5 and 6 below. With S = |S| and A = |A|, the sample complexity of vanilla RMax is O(S2A/(ϵ3(1 γ)3)), which is improved by LRMax in Proposition 5 and meets Condition C2. Finally, ˆU is a provable upper bound with high probability on Q M, which avoids negative transfer and meets Condition C3. Proposition 5 (Sample complexity (Strehl, Li, and Littman 2009)). With probability 1 δ, the greedy policy w.r.t. Q computed by LRMax achieves an ϵ-optimal return in MDP M after

S|{s, a S A | ˆU(s, a) V M(s) ϵ}| ϵ3(1 γ)3

samples (when logarithmic factors are ignored), with ˆU deﬁned in Equation 10 a non-static, decreasing quantity, upper bounded by 1 1 γ .

Algorithm 1: Lipschitz RMax algorithm

Initialize ˆ M = . for each newly sampled MDP M do

Initialize Q(s, a) = 1 1 γ , s, a, and K =

Initialize ˆT and ˆR (RMax initialization) Q Update Q( ˆ M, ˆT, ˆR) for t [1, max number of steps] do

s = current state, a = arg max a Q(s, a )

Observe reward r and next state s n(s, a) n(s, a) + 1 if n(s, a) < nknown then

Store (s, a, r, s ) if n(s, a) = nknown then

Update K and ( ˆT a ss , ˆRa s) (learned model) Q Update Q( ˆ M, ˆT, ˆR)

Save ˆ M = ˆT, ˆR, K, Q in ˆ M

Function Update Q( ˆ M, ˆT, ˆR): for M M do Compute ˆDsa(M M), ˆDsa( M M) (Eq. 7) Compute ˆdsa(M M), ˆdsa( M M) (DP on Eq. 8) Compute ˆU M (Eq. 9)

Compute ˆU (Eq. 10) Compute and return Q (DP on Eq. 6 using ˆU)

Proposition 5 shows that the sample complexity of LRMax is no worse than that of RMax. Consequently, in the worst case, LRMax performs as badly as learning from scratch, which is to say that the transfer method is not negative as it cannot degrade the performance.

Proposition 6 (Computational complexity). The total computational complexity of LRMax (Algorithm 1) is

O τ + S3A2N

(1 γ) ln 1 ϵQ(1 γ)

with τ the number of interaction steps, ϵQ the precision of value iteration and N the number of source tasks.

4.4 Reﬁning the LRMax Bounds LRMax relies on bounds on the local MDP dissimilarity (Equation 8). The quality of the Lipschitz bound on Q M can be improved according to the quality of those estimates. We discuss two methods to provide ﬁner estimates. Reﬁning with prior knowledge. First, from the deﬁnition of Dsa(M M), it is easy to show that this pseudometric between models is always upper bounded by 1+γ

1 γ . However, in practice, the tasks experienced in a Lifelong RL experiment might not cover the full span of possible MDPs M and may systematically be closer to each other than 1+γ

1 γ . For instance, the distance between two games in the Arcade Learning Environment (ALE) (Bellemare et al. 2013), is smaller than

Set of all the MDPs M

Set M of possible MDPs in a lifelong RL experiment

Sampled MDPs

Figure 2: Illustration of the prior knowledge on the maximum pseudo-distance between models for a particular s, a pair.

the maximum distance between any two MDPs deﬁned on the common state-action space of the ALE (extended discussion in Appendix, Section 11). Let us note M M the set of possible MDPs for a particular Lifelong RL experiment. Let Dmax(s, a) max M, M M2 Dsa(M M) be the maximum model pseudo-distance at a particular s, a pair on the subset M. Prior knowledge might indicate a smaller upper bound for Dmax(s, a) than 1+γ

1 γ . We will note such an upper bound Dmax, considered valid for all s, a pairs, i.e., such that Dmax max s,a,M, M S A M2 Dsa(M M) . In a Lifelong

RL experiment, Dmax can be seen as a rough estimate of the maximum model discrepancy an agent may encounter. Figure 2 illustrates the relative importance of Dmax vs. 1+γ

Solving Equation 8 boils down to accumulating ˆDsa(M M) values in ˆdsa(M M). Hence, reducing a ˆDsa(M M) estimate in a single s, a pair actually reduces ˆdsa(M M) in all s, a pairs. Thus, replacing ˆDsa(M M) in Equation 8 by min{Dmax, ˆDsa(M M)}, provides a smaller upper bound ˆdsa(M M) on dsa(M M), and thus a smaller ˆU which allows transfer if it is less than 1 1 γ . Consequently, the knowledge of such a bound Dmax can make a difference between successful and unsuccessful transfer, even if its value is of little importance. Conversely, setting a value for Dmax quantiﬁes the distance between MDPs where transfer is efﬁcient. Reﬁning by learning the maximum distance. The value of Dmax(s, a) can be estimated online for each s, a pair, discarding the hypothesis of available prior knowledge. We propose to use an empirical estimate of the maximum model distance at s, a: ˆDmax(s, a) max M, M ˆ M2{ ˆDsa(M M)},

with ˆ M the set of explored tasks. The pitfall of this approach is that, with few explored tasks, ˆDmax(s, a) could underestimate Dmax(s, a). Proposition 7 provides a lower bound on the probability that ˆDmax(s, a) + ϵ does not underestimate Dmax(s, a), depending on the number of sampled tasks.

Proposition 7. Consider an algorithm producing ϵ-accurate model estimates ˆDsa(M M) for a subset K of S A after interacting with any two MDPs M, M M. Assume ˆDsa(M M) Dsa(M M) for any s, a / K. For all s, a S A, δ (0, 1], after sampling m tasks, if m is large enough to verify 2(1 pmin)m (1 2pmin)m δ,

Pr ˆDmax(s, a) + ϵ Dmax(s, a) 1 δ .

This result indicates when ˆDmax(s, a) + ϵ upper bounds Dmax(s, a) with high probability. In such a case, ˆDsa(M M) of Equation 8 can be replaced by min{ ˆDmax(s, a) + ϵ, ˆDsa(M M)} to tighten the bound on dsa(M M). Assuming a lower bound pmin on the sampling probability of a task implies that M is ﬁnite and is seen as a non-adversarial task sampling rule (Abel et al. 2018).

5 Experiments

The experiments reported here1 illustrate how the Lipschitz bound (Equation 9) provides a tighter upper bound on Q , improving the sample complexity of LRMax compared to RMax, and making the transfer of inter-task knowledge effective. Graphs are displayed with 95% conﬁdence intervals. For information in line with the Machine Learning Reproducibility Check-list (Pineau 2019) see the Appendix, Section 16. We evaluate different variants of LRMax in a Lifelong RL experiment. The RMax algorithm will be used as a notransfer baseline. LRMax(x) denotes Algorithm 1 with prior Dmax = x. Max QInit denotes the MAXQINIT algorithm from Abel et al. (2018), consisting in a state-of-the art PACMDP algorithm. Both LRMax and Max QInit algorithms achieve value transfer by providing a tighter upper bound on Q than 1 1 γ . Computing both upper bounds and taking the minimum results in combining the two approaches. We include such a combination in our study with the LRMax QInit algorithm. Similarly, the latter algorithm beneﬁting from prior knowledge Dmax = x is denoted by LRMax QInit(x). For the sake of comparison, we only compare algorithms with the same features, namely, tabular, online, PAC-MDP methods, presenting non-negative transfer. The environment used in all experiments is a variant of the tight task used by Abel et al. (2018). It is an 11 11 gridworld, the initial state is in the centre, actions are the cardinal moves (Appendix, Section 12). The reward is always zero except for the three goal cells in the upper-right corner. Each sampled task has its own reward values, drawn from [0.8, 1] for each of the three goal cells and its own probability of slipping (performing a different action than the one selected), picked in [0, 0.1]. Hence, tasks have different reward and transition functions. Notice the distinction in applicability between Max QInit, that requires the set of MDPs to be ﬁnite, and LRMax, that can be used with any set of MDPs. For the comparison between both to be possible, we drew tasks from a ﬁnite set of 5 MDPs. We sample 15 tasks sequentially among this set, each run for 2000 episodes of length 10. The operation is repeated 10 times to narrow the conﬁdence intervals. We set nknown = 10, δ = 0.05, and ϵ = 0.01 (discussion in Appendix, Section 15). Other Lifelong RL experiments are reported in Appendix, Section 13. The results are reported in Figure 3. Figure 3a displays the discounted return for each task, averaged across episodes. Similarly, Figure 3b displays the discounted return for each episode, averaged across tasks (same color code as Figure 3a). Figure 3c displays the discounted return for ﬁve speciﬁc instances, detailed below. To avoid inter-task disparities, all

1Code available at https://github.com/Su Re LI/llrl

the aforementioned discounted returns are displayed relative to an estimator of the optimal expected return for each task. For readability, Figures 3b and 3c display a moving average over 100 episodes. Figure 3d reports the beneﬁts of various values of Dmax on the algorithmic properties. In Figure 3a, we ﬁrst observe that LRMax beneﬁts from the transfer method, as the average discounted return increases as more tasks are experienced. Moreover, this advantage appears as early as the second task. In contrast, Max QInit requires to wait for task 12 before beneﬁting from transfer. As suggested in Section 4.4, increasing amounts of prior knowledge allow the LRMax transfer method to be more efﬁcient: a smaller known upper bound Dmax on ˆDsa(M M) accelerates convergence. Combining both approaches in the LRMax QInit algorithm outperforms all other methods. Episode-wise, we observe in Figure 3b that the LRMax transfer method allows for faster convergence, i.e., lower sample complexity. Interestingly, LRMax exhibits three stages in the learning process. 1) The ﬁrst episodes are characterized by a direct exploitation of the transferred knowledge, causing these episodes to yield high payoff. This behavior is a consequence of the combined facts that the Lipschitz bound (Equation 9) is larger on promising regions of S A seen on previous tasks and the fact that LRMax acts greedily w.r.t. that bound. 2) This high performance regime is followed by the exploration of unknown regions of S A, in our case yielding low returns. Indeed, as promising regions are explored ﬁrst, the bound becomes tighter for the corresponding state-action pairs, enough for the Lipschitz bound of unknown pairs to become larger, thus driving the exploration towards low payoff regions. Such regions are then identiﬁed and never revisited. 3) Eventually, LRMax stops exploring and converges to the optimal policy. Importantly, in all experiments, LRMax never experiences negative transfer, as supported by the provability of the Lipschitz upper bound with high probability. LRMax is at least as efﬁcient as the no-transfer RMax baseline. Figure 3c displays the collected returns of RMax, LRMax(0.1), and Max QInit for speciﬁc tasks. We observe that LRMax beneﬁts from transfer as early as Task 2, where the previous 3-stage behavior is visible. Max QInit takes until task 12 to leverage the transfer method. However, the bound it provides is tight enough that it does not have to explore. In Figure 3d, we display the following quantities for various values of Dmax: ρLip, the fraction of the time the Lipschitz bound was tighter than the RMax bound 1 1 γ ; ρSpeed-up, is the relative gain of time steps before convergence when comparing LRMax to RMax. This quantity is estimated based on the last updates of the empirical model M; ρReturn, is the relative total return gain on 2000 episodes of LRMax w.r.t. RMax. First, we observe an increase of ρLip as Dmax becomes tighter. This means that the Lipschitz bound of Equation 9 becomes effectively smaller than 1 1 γ . This phenomenon leads to faster convergence, indicated by ρSpeed-up. Eventually, this increased convergence rate allows for a net total return gain, as can be seen with the increase of ρReturn. Overall, in this analysis, we have showed that LRMax beneﬁts from an enhanced sample complexity thanks to the valuetransfer method. The knowledge of a prior Dmax increases

2 4 6 8 10 12 14 Task number

Average Relative Discounted Return

RMax LRMax LRMax(0.2)

Max QInit LRMax QInit LRMax QInit(0.1)

(a) Average discounted return vs. tasks

0 250 500 750 1000 1250 1500 1750 2000 Episode number

Average Relative Discounted Return

(b) Average discounted return vs. episodes

0 250 500 750 1000 1250 1500 1750 2000 Episode number

Relative discounted return

RMax Task = 1 LRMax(0.1) Task = 1

LRMax(0.1) Task = 2

Max QInit Task = 11 Max QInit Task = 12

(c) Discounted return for speciﬁc tasks

0.0 0.2 0.4 0.6 0.8 1.0 Prior knowledge (known upper-bound on maxs,a = DM M γV M (s, a))

ρLip (% use Lipschitz bound)

ρSpeed up (% convergence speed-up)

ρReturn (% total return gain)

Prior knowledge Dmax

(d) Algorithmic properties vs. Dmax

Figure 3: Experimental results. LRMax beneﬁts from an enhanced sample complexity thanks to the value-transfer method.

this beneﬁt. The method is comparable to the Max QInit method and has some advantages such as the early ﬁtness for use and the applicability to inﬁnite sets of tasks. Moreover, the transfer is non-negative while preserving the PAC-MDP guarantees of the algorithm. Additionally, we show in Appendix, Section 14 that, when provided with any prior Dmax, LRMax increasingly stops using it during exploration, conﬁrming the claim of Section 4.4 that providing Dmax enables transfer even if its value is of little importance.

6 Conclusion We have studied theoretically the Lipschitz continuity property of the optimal Q-function in the MDP space w.r.t. a new metric. We proved a local Lipschitz continuity result, establishing that the optimal Q-functions of two close MDPs are themselves close to each other. We then proposed a valuetransfer method using this continuity property with the Lipschitz RMax algorithm, practically implementing this approach in the Lifelong RL setting. The algorithm preserves PAC-MDP guarantees, accelerates learning in subsequent

tasks and exhibits no negative transfer. Improvements of the algorithm were discussed in the form of prior knowledge on the maximum distance between models and online estimation of this distance. As a non-negative, similarity-based, PAC-MDP transfer method, the LRMax algorithm is the ﬁrst method of the literature combining those three appealing features. We showcased the algorithm in Lifelong RL experiments and demonstrated empirically its ability to accelerate learning while not experiencing any negative transfer. Notably, our approach can directly extend other PAC-MDP algorithms (Szita and Szepesv ari 2010; Rao and Whiteson 2012; Pazis, Parr, and How 2016; Dann, Lattimore, and Brunskill 2017) to the Lifelong setting. In hindsight, we believe this contribution provides a sound basis to non-negative value transfer via MDP similarity, a study that was lacking in the literature. Key insights for the practitioner lie both in the theoretical analysis and in the practical derivation of a transfer scheme achieving non-negative transfer with PAC guarantees. Further, designing scalable methods conveying the same intuition could be a promising research direction.

Acknowledgments We would like to thank Dennis Wilson for fruitful discussions and comments on the paper. This research was supported by the Occitanie region, France; ISAE-SUPAERO (Institut Sup erieur de l A eronautique et de l Espace); fondation ISAESUPAERO; Ecole Doctorale Syst emes; and ONERA, the French Aerospace Lab.

References Abel, D.; Jinnai, Y.; Guo, S. Y.; Konidaris, G.; and Littman, M. L. 2018. Policy and Value Transfer in Lifelong Reinforcement Learning. In Proceedings of the 35th International Conference on Machine Learning (ICML 2018), 20 29.

Ammar, H. B.; Eaton, E.; Taylor, M. E.; Mocanu, D. C.; Driessens, K.; Weiss, G.; and Tuyls, K. 2014. An Automated Measure of MDP Similarity for Transfer in Reinforcement Learning. In Workshops at the 28th AAAI Conference on Artiﬁcial Intelligence (AAAI 2014).

Asadi, K.; Misra, D.; and Littman, M. L. 2018. Lipschitz Continuity in Model-Based Reinforcement Learning. Proceedings of the 35th International Conference on Machine Learning (ICML 2018) .

Bellemare, M. G.; Naddaf, Y.; Veness, J.; and Bowling, M. 2013. The Arcade Learning Environment: An Evaluation Platform for General Agents. Journal of Artiﬁcial Intelligence Research 47: 253 279.

Bellman, R. 1957. Dynamic Programming. Princeton, USA: Princeton University Press.

Brafman, R. I.; and Tennenholtz, M. 2002. R-max - a General Polynomial Time Algorithm for Near-Optimal Reinforcement Learning. Journal of Machine Learning Research 3(Oct): 213 231.

Brunskill, E.; and Li, L. 2013. Sample Complexity of Multitask Reinforcement Learning. In Proceedings of the 29th conference on Uncertainty in Artiﬁcial Intelligence (UAI 2013).

Brunskill, E.; and Li, L. 2014. PAC-inspired Option Discovery in Lifelong Reinforcement Learning. In Proceedings of the 31st International Conference on Machine Learning (ICML 2014), 316 324.

Carroll, J. L.; and Seppi, K. 2005. Task Similarity Measures for Transfer in Reinforcement Learning Task Libraries. In Proceedings of the 5th International Joint Conference on Neural Networks (IJCNN 2005), volume 2, 803 808. IEEE.

Dann, C.; Lattimore, T.; and Brunskill, E. 2017. Unifying PAC and Regret: Uniform PAC Bounds for Episodic Reinforcement Learning. In Advances in Neural Information Processing Systems 30 (Neur IPS 2017), 5713 5723.

Ferns, N.; Panangaden, P.; and Precup, D. 2004. Metrics for Finite Markov Decision Processes. In Proceedings of the 20th conference on Uncertainty in Artiﬁcial Intelligence (UAI 2004), 162 169. AUAI Press.

Lazaric, A. 2012. Transfer in Reinforcement Learning: a Framework and a Survey. In Reinforcement Learning, 143 173. Springer.

Lazaric, A.; Restelli, M.; and Bonarini, A. 2008. Transfer of Samples in Batch Reinforcement Learning. In Proceedings of the 25th International Conference on Machine Learning (ICML 2008), 544 551. Mahmud, M. M.; Hawasly, M.; Rosman, B.; and Ramamoorthy, S. 2013. Clustering Markov Decision Processes for Continual Transfer. Computing Research Repository (ar Xiv/Co RR) URL https://arxiv.org/abs/1311.3959. Pazis, J.; Parr, R. E.; and How, J. P. 2016. Improving PAC Exploration using the Median of Means. In Advances in Neural Information Processing Systems 29 (Neur IPS 2016), 3898 3906. Pineau, J. 2019. Machine Learning Reproducibility Checklist. https://www.cs.mcgill.ca/ jpineau/ Reproducibility Checklist.pdf. Version 1.2, March 27, 2019, last accessed on August 27, 2020. Pirotta, M.; Restelli, M.; and Bascetta, L. 2015. Policy gradient in Lipschitz Markov Decision Processes. Machine Learning 100(2-3): 255 283. Puterman, M. L. 2014. Markov Decision Processes: Discrete Stochastic Dynamic Programming. John Wiley & Sons. Rachelson, E.; and Lagoudakis, M. G. 2010. On the Locality of Action Domination in Sequential Decision Making. In Proceedings of the 11th International Symposium on Artiﬁcial Intelligence and Mathematics (ISAIM 2010). Rao, K.; and Whiteson, S. 2012. V-MAX: Tempered Optimism for Better PAC Reinforcement Learning. In Proceedings of the 11th International Conference on Autonomous Agents and Multiagent Systems (AAMAS 2012), 375 382. Silver, D. L.; Yang, Q.; and Li, L. 2013. Lifelong Machine Learning Systems: Beyond Learning Algorithms. In AAAI Spring Symposium: Lifelong Machine Learning, volume 13, 05. Song, J.; Gao, Y.; Wang, H.; and An, B. 2016. Measuring the Distance Between Finite Markov Decision Processes. In Proceedings of the 15th International Conference on Autonomous Agents and Multiagent Systems (AAMAS 2016), 468 476. Sorg, J.; and Singh, S. 2009. Transfer via Soft Homomorphisms. In Proceedings of the 8th International Conference on Autonomous Agents and Multiagent Systems (AAMAS 2009), 741 748. International Foundation for Autonomous Agents and Multiagent Systems. Strehl, A. L.; Li, L.; and Littman, M. L. 2009. Reinforcement Learning in Finite MDPs: PAC Analysis. Journal of Machine Learning Research 10(Nov): 2413 2444. Sutton, R. S.; and Barto, A. G. 2018. Reinforcement Learning: An Introduction. MIT press, Cambridge. Szita, I.; and Szepesv ari, C. 2010. Model-Based Reinforcement Learning with Nearly Tight Exploration Complexity Bounds. In Proceedings of the 27th International Conference on Machine Learning (ICML 2010), 1031 1038. Taylor, M. E.; and Stone, P. 2009. Transfer Learning for Reinforcement Learning Domains: A Survey. Journal of Machine Learning Research 10(Jul): 1633 1685.

Villani, C. 2008. Optimal Transport: Old and New, volume 338. Springer Science & Business Media. Wilson, A.; Fern, A.; Ray, S.; and Tadepalli, P. 2007. Multi Task Reinforcement Learning: A Hierarchical Bayesian Approach. In Proceedings of the 24th International Conference on Machine Learning (ICML 2007), 1015 1022.