# reinforcement_learning_with_trajectory_feedback__397d0e79.pdf

Reinforcement Learning with Trajectory Feedback

Yonathan Efroni 1,2, Nadav Merlis 1 and Shie Mannor1,3

1Technion, Israel Institute of Technology 2Microsoft Research, New York 3Nvidia Research, Israel

The standard feedback model of reinforcement learning requires revealing the reward of every visited state-action pair. However, in practice, it is often the case that such frequent feedback is not available. In this work, we take a ﬁrst step towards relaxing this assumption and require a weaker form of feedback, which we refer to as trajectory feedback. Instead of observing the reward obtained after every action, we assume we only receive a score that represents the quality of the whole trajectory observed by the agent, namely, the sum of all rewards obtained over this trajectory. We extend reinforcement learning algorithms to this setting, based on least-squares estimation of the unknown reward, for both the known and unknown transition model cases, and study the performance of these algorithms by analyzing their regret. For cases where the transition model is unknown, we offer a hybrid optimistic-Thompson Sampling approach that results in a tractable algorithm.

1 Introduction The ﬁeld of Reinforcement Learning (RL) tackles the problem of learning how to act optimally in an unknown dynamical environment. Recently, RL witnessed remarkable empirical success (e.g., Mnih et al. 2015; Levine et al. 2016; Silver et al. 2017). However, there are still some matters that hinder its use in practice. One of them, we claim, is the type of feedback an RL agent is assumed to observe. Speciﬁcally, in the standard RL formulation, an agent acts in an unknown environment and receives feedback on its actions in the form of a state-action dependent reward signal. Although such an interaction model seems undemanding at ﬁrst sight, in many interesting problems, such reward feedback cannot be realized. In practice, and speciﬁcally in non-simulated environments, it is hardly ever the case that an agent can query a state-action reward function from every visited state-action pair since such a query can be very costly. For example. consider the following problems: (i) Consider the challenge of autonomous car driving. Would we want to deploy an RL algorithm for this setting, we would need a reward signal from every visited state-action pair. Obtaining such data is expected to be very costly since

Equal contribution, alphabetical order Copyright c 2021, Association for the Advancement of Artiﬁcial Intelligence (www.aaai.org). All rights reserved.

it requires scoring each state-action pair with a real number. For example, if a human is involved in providing the feedback, (a) he or she might refuse to supply with such feedback due to the Sisyphean nature of this task, or (b) supplying with such feedback might take too much time for the needs of the algorithm designer.

(ii) Consider a multi-stage UX interface that we want to optimize using an RL algorithm. To do so, in the standard RL setting, we would need a score for every visited state-action pair. However, as we ask the users for more information on the quality of different state-action pairs, the users opinions might change due to the information they need to supply. For example, as we ask for more information, the user might be prone to be more negative about the quality of the interface as the user becomes less patient to provide the requested feedback. Thus, we would like to keep the number of queries from the user to be minimal.

Rather than circumventing this problem by deploying heuristics (e.g., by hand-engineering a reward signal), in this work, we relax the feedback mechanism to a weaker and more practical one. We then study RL algorithms in the presence of this weaker form of feedback mechanism, a setting which we refer to as RL with trajectory feedback. In RL with trajectory feedback, the agent does not have access to a per state-action reward function. Instead, it receives the sum of rewards on the performed trajectory as well as the identity of visited state-action pairs in the trajectory. E.g., for autonomous car driving, we only require feedback on the score of a trajectory, instead of the score of each individual stateaction pair. Indeed, this form of feedback is much weaker than the standard RL feedback and is expected to be more common in practical scenarios. We start by deﬁning our setting and specifying the interaction model of RL with trajectory feedback (Section 2). In Section 3, we introduce a natural least-squares estimator with which the true reward function can be learned based on the trajectory feedback. Building on the leastsquares estimator, we study algorithms that explicitly tradeoff exploration and exploitation. We start by considering the case where the model is known while the reward function needs to be learned. By generalizing the analysis of standard linear bandit algorithms (OFUL (Abbasi-Yadkori, P al, and Szepesv ari 2011) and Thompson-Sampling (TS) for lin-

The Thirty-Fifth AAAI Conference on Artificial Intelligence (AAAI-21)

ear bandits (Agrawal and Goyal 2013)), we establish performance guarantees for this setup in sections 4 and 5.1. Although the OFUL-based algorithm gives better performance than the TS-based algorithm, its update rule is computationally intractable, as it requires solving a convex maximization problem. Thus, in Section 5.2 we generalize the TS-based algorithm to the case where both the reward and the transition model are unknown. To this end, we learn the reward by a TS approach and learn the transition model by an optimistic approach. The combination of the two approaches yields a computationally tractable algorithm, which requires solving an empirical Markov Decision Process (MDP) in each round. For all algorithms, we establish regret guarantees that scale as

K, where K is the number of episodes. Finally, in Section 5.3, we identify the most computationally demanding stage in the algorithm and suggest a variant to the algorithm that rarely performs this stage. Notably, we show that the effect of this modiﬁcation on the regret is minor. A summary of our results can be found in Table 1.

2 Notations and Deﬁnitions We consider ﬁnite-horizon MDPs with time-independent dynamics. A ﬁnite-horizon MDP is deﬁned by the tuple M = (S, A, R, P, H), where S and A are the state and action spaces with cardinalities S and A, respectively. The immediate reward for taking an action a at state s is a random variable R(s, a) [0, 1] with expectation E[R(s, a)] = r(s, a). The transition kernel is P(s | s, a), the probability of transitioning to state s upon taking action a at state s. H N is the horizon, i.e., the number of time-steps in each episode, and K N is the total number of episodes. We deﬁne [N] def = {1, . . . , N}, for all N N, and use h [H] and k [K] to denote time-step inside an episode and the index of an episode, respectively. We also denote the initial state in episode k [K] by sk 1, which can be arbitrarily chosen. A deterministic policy π : S [H] A is a mapping from states and time-step indices to actions. We denote by ah def = π(sh, h), the action taken at time h at state sh according to a policy π. The quality of a policy π from state s at time h is measured by its value function, which is deﬁned as

V π h (s) def = E

h =h r(sh , π(sh , h )) | sh = s, π

where the expectation is over the environment randomness. An optimal policy maximizes this value for all states s and time-steps h simultaneously, and the corresponding optimal value is denoted by V h (s) def = maxπ V π h (s), for all h [H]. We can also reformulate the optimization problem using the occupancy measure (e.g., Puterman 1994; Altman 1999). The occupancy measure of a policy π is deﬁned as the distribution over state-action pairs generated by executing the policy π in the ﬁnite-horizon MDP M with a transition kernel p (e.g., Zimin and Neu 2013):

qπ h(s, a; p) def = E[1(sh = s, ah = a) | s1 = s1, p, π] = Pr{sh = s, ah = a | s1 = s1, p, π}

For brevity, we deﬁne the matrix notation qπ(p) RHSA where its (s, a, h) element is given by qπ h(s, a; p). Furthermore, let the average occupancy measure be dπ(p) RSA

such that dπ(s, a; p) def = PH h=1 qπ h(s, a; p). For ease of notation, when working with the transition kernel of the true model p = P, we write qπ = qπ(P) and dπ = dπ(P). This deﬁnition implies the following relation:

V π 1 (s1; p, r) = X

sh,ah r(sh, ah)qπ h(sh, ah; p)

s,a dπ(s, a; p)r(s, a) = dπ(p)T r, (1)

where V π 1 (s1; p, r) is the value of an MDP whose reward function is r and its transition kernel is p.

Interaction Model of Reinforcement Learning with Trajectory Feedback. We now deﬁne the interaction model of RL agents that receive trajectory feedback, the model that we analyze in this work. We consider an agent that repeatedly interacts with an MDP in a sequence of episodes [K]. The performance of the agent is measured by its regret, deﬁned as R(K) def = PK k=1 V 1 (sk 1) V πk 1 (sk 1) . We denote by sk h and ak h for the state and the action taken at the hth

time-step of the kth episode. At the end of each episode k [K], the agent only observes the cumulative reward experienced while following its policy πk and the identity of the visited state-action pairs, i.e.,

ˆVk(sk 1) =

h=1 R(sk h, ak h), and, (sk h, ak h) H+1

This comes in contrast to the standard RL setting, in which the agent observes the reward per visited state-action pair, R(sk h, ak h) H h=1. Thus, RL with trajectory feedback receives more obscured feedback from the environment on the quality of its actions. Obviously, standard RL feedback allows calculating ˆVk(sk 1), but one cannot generally reconstruct R(sk h, ak h) H h=1 by accessing only ˆVk(sk 1). Next, we deﬁne the ﬁltration Fk that includes all events (states, actions, and rewards) until the end of the kth episode, as well as the initial state of the episode k +1. We denote by T = KH, the total number of time-steps (samples). Moreover, we denote by nk(s, a), the number of times that the agent has visited a state-action pair (s, a), and by ˆXk, the empirical average of a random variable X. Both quantities are based on experience gathered until the end of the kth episode and are Fk measurable.

Notations. We use O(X) to refer to a quantity that is upper bounded by X, up to poly-log factors of S, A, T, K, H, and 1

δ . Furthermore, the notation O(X) refers to a quantity that is upper bounded by X up to constant multiplicative factors. We use X Y def = max{X, Y } and denote Im as the identity matrix in dimension m. Similarly, we denote by

Result Exploration Model Learning Time Complexity Regret Theorem 3 OFUL X Computationally-Hard O SAH

Theorem 4 TS X O (SA)3 + S2A(H + A) O (SA)3/2H

Theorem 5 TS V O (SA)3 + S2A(H + A) O S2A3/2H3/2

Theorem 7 TS V O S2A(H + A) + (SA)4

log(1+C) log KH

Table 1: S and A are the state and action sizes, respectively, and H is the horizon. K is the number of episode and C > 0 is a parameter of Algorithm 4. Exploration - whether the reward exploration is optimistic ( OFUL ) or uses posterior sampling ( TS ). Model Learning - whether the algorithm knows the model (X) or has to learn it (V). Time complexity - per-episode average time complexity. The hardness of the optimistic algorithm is explained at the end of Section 4 and the time complexity of the TS-based algorithm is explained in Section 5.3. Regret bounds ignore log-factors and constants and assume that SA H.

0m Rm, the vector whose components are zeros. Finally, for any positive deﬁnite matrix M Rm m and any vector x Rm, we deﬁne x M =

3 From Trajectory Feedback to Least-Squares Estimation In this section, we examine an intuitive way for estimating the true reward function r, given only the cumulative rewards on each of the past trajectories and the identities of visited state-actions. Speciﬁcally, we estimate r via a Least Squares (LS) estimation. Consider past data in the form of (2). To make the connection of the trajectory feedback to LS estimation more apparent, let us rewrite (2) as follows, ˆVk(sk 1) = ˆq T k R, (3)

where ˆqk RSAH is the empirical state-action visitation vector given by ˆqk(s, a, h) = 1 s = sk h, a = ak h [0, 1], and R RSAH is the noisy version of the true reward function, namely R(s, a, h) = R(sh, ah). Indeed, since the identity of visited state-action pairs is given to us, we can compute ˆqk using our data. Furthermore, observe that

E ˆq T k R|ˆqk = X

s,a,h ˆqk(s, a, h)r(s, a)

h=1 ˆqk(s, a, h)

r(s, a) def = ˆd T k r,

where the ﬁrst equality holds since we assume the rewards are i.i.d. and drawn at the beginning of each episode. In the last inequality we deﬁned the empirical stateaction frequency vector ˆdk RSA where ˆdk(s, a) = PH h=1 ˆqk(s, a, h), or, alternatively,

ˆdk(s, a) =

h=1 1 s = sk h, a = ak h [0, H].

This observation enables us to think of our data as noisy samples of ˆd T k r, from which it is natural to estimate the reward r by a (regularized) LS estimator, i.e., for some λ > 0,

ˆrk arg min r

l=1 ( ˆdl, r ˆVl)2 + λISA

This estimator is also given by the closed form solution

ˆrk = (DT k Dk + λISA) 1Yk def = A 1 k Yk, (4)

where Dk Rk SA is a matrix with n ˆd T k o in its rows, Yk = Pk s=1 ˆds ˆVs RSA and Ak = DT k Dk + λISA RSA SA. A needed property of the estimator ˆrk is for it to be concentrated around the true reward r. By properly deﬁning the ﬁltration and observing that ˆVk is p

H/4 sub-Gaussian given ˆdk (as a sum of H independent variables in [0, 1]), it is easy to establish a uniform concentration bound via Theorem 2 of (Abbasi-Yadkori, P al, and Szepesv ari 2011) (for completeness, we provide the proof in Appendix I). Proposition 1 (Concentration of Reward). Let λ > 0 and

Ak def = DT k Dk + λISA. For any δ (0, 1), with probability greater than 1 δ/10 uniformly for all k 0, it holds that

1 4SAH log 1+k H2/λ

λSA def = lk.

Relation to Linear Bandits. Assume that the transition kernel P is known while the reward r is unknown. Also, recall that the set of the average occupancy measures, denoted by K(P), is a convex set (Altman 1999). Then, Equation (1) establishes that RL with trajectory feedback can be understood as an instance of linear-bandits over a convex set. I.e., it is equivalent to the problem of minimizing the regret R(K) = P k maxd K(P ) d T r d T πkr, where the feedback

is a noisy version of d T πkr since E h ˆVk | Fk 1 i = d T πkr. Under this formulation, choosing a policy πk is equivalent to choosing an action from the convex set dπk K(P). However, we make use of ˆdk, and not the actual action that was taken, dπk. Importantly, this view allows us to generalize algorithms to the case where the transition model P is unknown (as we do in Section 5). When the transition model is not known and estimated via P, there is an error in identifying the action, in the view of linear bandits, since dπ = dπ( P). This error, had we used the na ıve linear bandit approach of choosing contexts from K( P), would result in errors in the matrix Ak. Since our estimator uses the empirical average state-action frequency, ˆdk, the fact the model is unknown does not distort the reward estimation.

Algorithm 1 OFUL for RL with Trajectory Feedback and Known Model

Require: δ (0, 1), λ = H,

1 4SAH log 1+k H2/λ

Initialize: A0 = λISA, Y0 = 0SA for k = 1, ..., K do Calculate ˆrk 1 via LS estimation (4) Solve πk arg maxπ d T π ˆrk 1 + lk 1 dπ A 1 k 1

Play πk, observe ˆVk and (sk h, ak h) H h=1 Update Ak = Ak 1 + ˆdk ˆd T k and Yk = Yk 1 + ˆdk ˆVk. end for

Policy Gradient Approach for RL with Trajectory Feedback. Another natural algorithmic approach to the setting of RL with trajectory feedback is to use policy search. That is, instead of estimating the reward function via least-square and follow a model-based approach, one can directly optimize over the policy class. By the log-derivative trick (as in the REINFORCE algorithm (Williams 1992)):

h=1 π log π(ah|sh)

h=1 r(sh, ah)

Thus, if we are supplied with the cumulative reward of a trajectory we can estimate the derivative πV π(s) and use stochastic gradient ascent algorithm. However, this approach fails in cases the exploration is challenging (Agarwal et al. 2020), i.e., the sample complexity can increase exponentially with H. We conjecture that by combining exploration bonus the REINFORCE algorithm can provably perform well, with polynomial sample complexity. We leave such an extension as an interesting future research direction.

4 OFUL for RL with Trajectory Feedback and Known Model Given the concentration of the estimated reward in Proposition 1, it is natural to follow the optimism in the face of uncertainty approach, as used in the OFUL algorithm (Abbasi Yadkori, P al, and Szepesv ari 2011) for linear bandits. We adapt this approach to RL with trajectory feedback, as depicted in Algorithm 1; on each episode, we ﬁnd a policy that maximizes the estimated value V π(s1; P, ˆrk 1) = d T π ˆrk 1, and an additional conﬁdence term lk 1 dπ A 1 k 1 that properly encourages the policy πk to be exploratory. The analysis of OFUL is based upon two key ingredients, (i) a concentration result, and (ii) an elliptical potential lemma. For the setting of RL with trajectory feedback, the concentration result can be established with similar tools to Abbasi-Yadkori, P al, and Szepesv ari (see Proposition 1). However, (ii), the elliptical potential lemma, should be re-derived. The usual elliptical potential lemma (Abbasi Yadkori, P al, and Szepesv ari 2011) states that for xk Rm,

k=1 xk Ak 1 O x p

where Ak = Ak 1 + xkx T k , A0 = λIm and xk x . However, for RL with trajectory feedback, the term we wish to bound is PK k=1 dπk Ak 1, where Ak = Ak 1 + ˆdk ˆd T k , A0 = λISA. Thus, since dπk = ˆdk, we cannot apply the usual elliptical potential lemma. Luckily, it is possible to derive a variation of the lemma, suited for our needs, by recognizing that dπk = E h ˆdk|Fk 1 i where the equality holds component-wise. Based on this observation, the next lemma central to the analysis of all algorithms in this work can be established (in Appendix B we prove a slightly more general statement that will be useful in next sections as well). Lemma 2 (Expected Elliptical Potential Lemma). Let λ > 0. Then, uniformly for all K > 0, with probability greater than 1 δ, it holds that

k=0 dπk A 1 k 1 O

λ KSA log λ + KH2

Proof Sketch. Applying Jensen s inequality (using the convexity of norms), we get

dπk A 1 k 1 = E h ˆdk|Fk 1 i A 1 k 1 E h ˆdk A 1 k 1|Fk 1 i

Therefore, we can write

k=0 dπk A 1 k 1

k=1 E h ˆdk A 1 k 1||Fk 1 i

E h ˆdk A 1 k 1||Fk 1 i ˆdk A 1 k 1

k=1 ˆdk A 1 k 1 | {z } (b)

where in the last relation, we added and subtracted the random variable ˆdk A 1 k 1. It is evident that (a) is a bounded martingale difference sequence and, thus, can be bounded with probability 1 δ/2 by

Term (b) can be bounded by applying the usual elliptical potential (Abbasi-Yadkori, P al, and Szepesv ari 2011)) by

2KSA log λ + KH2

Combining the bounds on (a), (b) concludes the proof.

Based on the concentration of the estimated reward ˆrk around the true reward r (Proposition 1) and the expected elliptical potential lemma (Lemma 2), the following performance guarantee of Algorithm 1 is established (see Appendix C for the full proof). Theorem 3 (OFUL for RL with Trajectory Feedback and Known Model). For any δ (0, 1), it holds with probability greater than 1 δ that for all K > 0,

Algorithm 2 TS for RL with Trajectory Feedback and Known Model

Require: δ (0, 1), λ = H, vk = q

9SAH log k H2

1 4SAH log 1+k H2/λ

Initialize: A0 = λISA, Y0 = 0SA for k = 1, ..., K do Calculate ˆrk 1 via LS estimation (4) Draw noise ξk N(0, v2 k A 1 k 1) and deﬁne rk = ˆrk 1 +ξk Solve an MDP with perturbed empirical reward πk arg maxπ d(P)T π rk Play πk, observe ˆVk and (sk h, ak h) H h=1 Update Ak = Ak 1 + ˆdk ˆd T k and Yk = Yk 1 + ˆdk ˆVk. end for

To exemplify how the expected elliptical potential lemma is applied in the analysis of Algorithm 1 we supply a sketch of the proof.

Proof Sketch. By the optimism of the update rule, following (Abbasi-Yadkori, P al, and Szepesv ari 2011), it is possible to show that with high probability,

V 1 (sk 1) = d T π r d T πk ˆrk 1 + lk 1 dπk A 1 k 1,

for any k > 0. Thus, we only need to bound the on-policy prediction error given as follows,

k=1 (V 1 V πk 1 )

k=1 (d T πk ˆrk 1 + lk 1 dπk A 1 k 1 d T πkr)

k=1 dπk A 1 k 1 . (5)

where the last inequality can be derived using Proposition 1 and the Cauchy Schwartz inequality. Applying the expected elliptical potential lemma (Lemma 2) and setting λ = H concludes the proof.

Although Algorithm 1 provides a natural solution to the problem, it results in a major computational disadvantage. The optimization problem needed to be solved in each iteration is a convex maximization problem (known to generally be NP-hard (Atamt urk and G omez 2017)). Furthermore, since dπ A 1 k 1 is non-linear in dπ, it restricts us from solving this problem by means of Dynamic Programming. In the next section, we follow a different route and formulate a Thompson Sampling based algorithm, with computational complexity that amounts to sampling a Gaussian noise for the reward and solving an MDP at each episode.

5 Thompson Sampling for RL with Trajectory Feedback The OFUL-based algorithm for RL with trajectory feedback, analyzed in the previous section, was shown to give good

performance in terms of regret. However, implementing the algorithm requires solving a convex maximization problem before each episode, which is, in general, computationally hard. Instead of following the OFUL-based approach, in this section, we analyze a Thompson Sampling (TS) approach for RL with trajectory feedback. We start by studying the performance of Algorithm 2, which assumes access to the transition model (as in Section 4). Then, we study Algorithm 3 which generalizes the latter method to the case where the transition model is unknown. In this generalization, we use an optimistic-based approach to learn the transition model, and a TS-based approach to learn the reward. The combination of optimism and TS results in a tractable algorithm in which every iteration amounts to solving an empirical MDP (which can be done by Dynamic Programming). The reward estimator in both Algorithm 2 and Algorithm 3 is the same LS estimator (4) used for the OFUL-like algorithm. Finally, we focus on improving the most computationally-demanding stage of Algorithm 3, which is the reward sampling, and suggest a more efﬁcient method in Algorithm 4.

5.1 TS for RL with Trajectory Feedback and Known Model

For general action sets, it is known that OFUL (Abbasi Yadkori, P al, and Szepesv ari 2011) results in a computationally intractable update rule. One popular approach to mitigate the computational burden is to resort to TS for linear bandits (Agrawal and Goyal 2013). Then, the update rule amounts to solving a linear optimization problem over the action set. Yet, the reduced computational complexity of TS comes at the cost of an increase in the regret. Speciﬁcally, for linear bandit problems in dimension m, OFUL achieves O(m

T), whereas TS achieves O(m3/2

T) (Agrawal and Goyal 2013; Abeille, Lazaric et al. 2017). Algorithm 2 can be understood as a TS variant of Algorithm 1, much like TS for linear bandits (Agrawal and Goyal 2013) is a TS variant of OFUL. Unlike the common TS algorithm for linear bandits, Algorithm 2 uses the LS estimator in Section 3, i.e., the one which uses the empirical stateaction distributions ˆdk, instead of the true action dπk. In terms of analysis, we deal with this discrepancy by applying as in Section 4 the expected elliptical potential lemma,1 instead of the standard elliptical potential lemma. Then, by extending techniques from (Agrawal and Goyal 2013; Russo 2019) we obtain the following performance guarantee for Algorithm 2 (see Appendix D for the full proof).

Theorem 4 (TS for RL with Trajectory Feedback and Known Model). For any δ (0, 1), it holds with probability greater than 1 δ that for all K > 0,

R(K) O (SA)3/2H p

K log(K) log KH

1We use a slightly more general version of the expected elliptical potential lemma, presented in Appendix B

Observe that Theorem 4 establishes a regret guarantee of m3/2

K since the dimension of the speciﬁc linear bandit problem is m = SA (see (1)). This is the type of regret is expected due to TS type of analysis (Agrawal and Goyal 2013). It is an interesting question whether this bound can be improved due to the structure of the problem.

5.2 UCBVI-TS for RL with Trajectory Feedback In previous sections, we devised algorithms for RL with trajectory feedback, assuming access to the true transition model and that only the reward function is needed to be learned. In this section, we relax this assumption and study the setting in which the transition model is also unknown. This setting highlights the importance of the LS estimator (4), which uses the empirical state-action frequency ˆdk, instead of dπk. I.e., when the transition model is not known, we do not have access to dπk. Nevertheless, it is reasonable to assume access to ˆdk since it only depends on the observed sequence of state-action pairs in the kth episode sk h, ak h H h=1 and does not require any access to the true model. For this reason, the LS estimator (4) is much more amenable to use in RL with trajectory feedback when the transition model is not given and needed to be estimated. Algorithm 3, which we refer to as UCBVI-TS (Upper Conﬁdence Bound Value Iteration and Thompson Sampling), uses a combined TS and optimistic approach for RL with trajectory feedback. At each episode, the algorithm perturbs the LS estimation of the reward ˆrk 1 by a random Gaussian noise ξk, similarly to Algorithm 2. Furthermore, to encourage the agent to learn the unknown transition model, UCBVI-TS adds to the reward estimation the bonus bpv k 1 RSA where

bpv k 1(s, a) H p

nk 1(s, a) 1 , (6)

up to logarithmic factors (similarly to Azar, Osband, and Munos 2017). Then, it simply solves the empirical MDP deﬁned by the plug-in transition model Pk 1 and the reward function ˆrk 1 +ξk +bpv k 1. Speciﬁcally, the transition kernel Pk 1 is estimated by

Pk(s |s, a)= Pk l=1 PH h=1 1 sl h = s, al h = a, sl h+1 = s

nk(s, a) 1 . (7)

The next result establishes a performance guarantee for UCBVI-TS with trajectory feedback (see proof in Appendix E.3). The key idea for the proof is showing that the additional bonus term (6) induces sufﬁcient amount of optimism with ﬁxed probability. Then, by generalizing the analysis of Theorem 4 while using some structural properties of MDPs we derive the ﬁnal result. Theorem 5 (UCBVI-TS Performance Guarantee). For any δ (0, 1), it holds with probability greater than 1 δ that for all K > 0,

SH(SA + H) p

AHK log K log SAHK

S(SA + H)2 log SAHK

Algorithm 3 UCBVI-TS for RL with Trajectory Feedback

Require: δ (0, 1), λ = H, vk = q

9SAH log k H2

1 4SAH log 1+k H2/λ

bpv k (s, a) =

H2 log 40SAH2k3

δ nk(s,a) 1 Initialize: A0 = λISA, Y0 = 0SA, Counters n0(s, a) = 0, s, a. for k = 1, ..., K do Calculate ˆrk 1 via LS estimation (4) and Pk by (7) Draw noise ξk N(0, v2 k A 1 k 1) and deﬁne rb k = ˆrk 1 + ξk + bpv k 1 Solve empirical MDP with optimistic-perturbed reward, πk arg maxπ d( Pk 1)T rb k Play πk, observe ˆVk and (sk h, ak h) H h=1 Update counters nk(s, a), Ak = Ak 1 + ˆdk ˆd T k and Yk = Yk 1 + ˆdk ˆVk. end for

thus, discarding logarithmic factors and constants and assuming SA H, R(K) O S2A3/2H3/2

5.3 Improving the Computational Efﬁciency of UCBVI-TS In this section, we present a modiﬁcation to UCBVI-TS that uses a doubling trick to improve the computational efﬁciency of Algorithm 3. Speciﬁcally, for m = SA, the complexity of different parts of UCBVI-TS is as follows: A 1 k can be iteratively updated using O m2 computations by the Sherman-Morrison formula (Bartlett 1951). Given A 1 k , calculating ˆrk requires O m2 computations.

Generating the noise ξk requires calculating A 1/2 k that, in general, requires performing a singular value decomposition to A 1 k at a cost of O(m3) computations. Finally, calculating the policy using dynamic programming requires O(S2AH) computations. In overall, UCBVI-TS requires O (SA)3 + S2A(H + A)

computations per episode, where the most demanding part is the noise generation. Thus, we suggest a variant of our algorithm, called Rarely-Switching UCBVI-TS (see Appendix F for the full description of the algorithm), that updates Ak (and, as a result, the LS estimator) only after the updates increase the determinant of Ak by a factor of 1 + C, for some C > 0, similarly to (Abbasi-Yadkori, P al, and Szepesv ari 2011). Speciﬁcally, we let Bk = λISA + Pk s=1 ˆds ˆd T s and update Ak = Bk if det(Bk) > (1 + C)det(Ak), where B0 = A0 = λISA. Otherwise, we keep Ak = Ak 1. By the matrix-determinant lemma, det(Bk) can be iteratively updated by det(Bk) = 1 + ˆd T k B 1 k 1 ˆdk det(Bk 1), which

requires O(SA) calculations given B 1 k 1; in turn, B 1 k 1 can be updated by the Sherman-Morrison formula. Notably, Ak is rarely updated, as we prove in the following lemma:

Lemma 6. Under the update rule of Rarely-Switching UCBVI-TS, and for any C > 0, λ > 0, it holds that PK k=1 1(Ak = Ak 1) SA log(1+C) log 1 + KH2

Therefore, the average per-round computational complexity of Rarely-Switching UCBVI-TS after K episodes is

S2A(H + A) + (SA)4

log(1 + C) log KH

Moreover, rarely updating Ak only affects the lower-order terms of the regret, as we prove in the following theorem: Theorem 7 (Rarely-Switching UCBVI-TS Performance Guarantee). For any δ (0, 1), it holds with probability greater than 1 δ that for all K > 0,

R(K) O SH(SA + H)

+ O (SA)3/2H p

The proof can be found in Appendix F. See that the difference from Theorem 5 is in the last term, which is negligible, compared to the ﬁrst term, for reasonable values of C > 0.

6 Discussion and Conclusions In this work, we formulated the framework of RL with trajectory feedback and studied different RL algorithms in the presence of such feedback. Indeed, in practical scenarios, such feedback is more reasonable to have, as it requires a weaker type of feedback relative to the standard RL one. For this reason, we believe studying it and understanding the gaps between the trajectory feedback RL and standard RL is of importance. The central result of this work is a hybrid optimistic-TS based RL algorithm with a provably bounded

K regret that can be applied when both the reward and transition model are unknown and, thus, needed to be learned. Importantly, the suggested algorithm is computationally tractable, as it requires to solve an empirical MDPs and not a convex maximization problem. Regret minimization for standard RL has been extensively studied. Previous algorithms for this scenario can be roughly divided into optimistic algorithms (Jaksch, Ortner, and Auer 2010; Azar, Osband, and Munos 2017; Jin et al. 2018; Dann et al. 2019; Zanette and Brunskill 2019; Simchowitz and Jamieson 2019; Efroni et al. 2019) and Thompson-Sampling (or Posterior-Sampling) based algorithms (Osband, Russo, and Van Roy 2013; Gopalan and Mannor 2015; Osband and Van Roy 2017; Russo 2019). Nonetheless, and to the best of our knowledge, we are the ﬁrst to present a hybrid approach that utilizes both concepts in the same algorithm. Speciﬁcally, we combine the optimistic conﬁdence-intervals of UCBVI (Azar, Osband, and Munos 2017) alongside linear TS for the reward and also take advantage of analysis tools for posterior sampling in RL (Russo 2019). In the presence of trajectory-feedback, our algorithms make use of concepts from linear bandits to learn the reward. Speciﬁcally, we use both OFUL (Abbasi-Yadkori, P al, and Szepesv ari 2011) and linear TS (Agrawal and Goyal 2013; Abeille, Lazaric et al. 2017), whose regret

bounds for m-dimension problems after K time-steps with 1-subgaussian noise are O n m

K o and O n m3/2

K o , respectively. These bounds directly affect the performance in the RL setting, but the adaptation of OFUL leads to a computationally-intractable algorithm. In addition, when there are at most N context, it is possible to achieve a regret bound of O m K log N (Chu et al. 2011); however, the number of deterministic policies, which are the number of contexts for RL with trajectory-feedback, is exponential in S, namely, ASH. Therefore, such approaches will lead to similar guarantees to OFUL and will also be computationally intractable. In terms of regret bounds, the minimax regret in the standard RL setting is O n

SAHT o (Osband and Van Roy 2016; Azar, Osband, and Munos 2017), however, for standard RL the reward feedback is much stronger than for RL with trajectory feedback. For linear bandits with

Hsubgaussian noise, the minimax performance bounds are O n m

HK o (Dani, Hayes, and Kakade 2008). Speciﬁ-

cally, in RL we set m = SA, which leads to O n SA

HK o . Nonetheless, for RL with trajectory feedback and known model, the context space is the average occupancy measures dπ, which is heavily-structured. It is an open question whether the minimax regret bound remains O n SA

for RL with trajectory feedback, when the transition model is known, or whether it can be improved. Moreover, when the model is unknown, our algorithm enjoys a regret of O S2A3/2H3/2

K when H SA. A factor of

SA is a direct result of the TS-approach, that was required to make to algorithm tractable, and an additional

S appears when the model is unknown. Moreover, extending OFUL to the case of unknown model and following a similar analysis to Theorem 5 would still yield this extra

S factor (and would result in a computationally hard algorithm), in comparison to when we know the model. It is an open question whether this additional factor can also be improved. Finally, we believe that this work paves the way to many interesting future research directions, notably, studying RL with additional, more realistic, feedback models of the reward. Furthermore, we believe that the results can be adapted to cases where the feedback is a more complex mapping from state-actions into trajectory-reward, and, speciﬁcally, a noisy generalized linear model (GLM) of the trajectory (Filippi et al. 2010; Abeille, Lazaric et al. 2017; Kveton et al. 2020). In this case, even though the reward function is not Markovian, our approach should allow deriving regret bounds. More generally, this can be viewed as a form of reward-shaping with theoretical guarantees, which is, in general, an open question.

Acknowledgments

This work was partially funded by the Israel Science Foundation under ISF grant number 2199/20. YE is partially supported by the Viterbi scholarship, Technion. NM is partially

supported by the Gutwirth Scholarship.

References Abbasi-Yadkori, Y.; P al, D.; and Szepesv ari, C. 2011. Improved algorithms for linear stochastic bandits. In Advances in Neural Information Processing Systems, 2312 2320. Abeille, M.; Lazaric, A.; et al. 2017. Linear thompson sampling revisited. Electronic Journal of Statistics 11(2): 5165 5197. Agarwal, A.; Kakade, S. M.; Lee, J. D.; and Mahajan, G. 2020. Optimality and approximation with policy gradient methods in markov decision processes. In Conference on Learning Theory, 64 66. PMLR. Agrawal, S.; and Goyal, N. 2013. Thompson sampling for contextual bandits with linear payoffs. In International Conference on Machine Learning, 127 135. Altman, E. 1999. Constrained Markov decision processes, volume 7. CRC Press. Atamt urk, A.; and G omez, A. 2017. Maximizing a class of utility functions over the vertices of a polytope. Operations Research 65(2): 433 445. Azar, M. G.; Osband, I.; and Munos, R. 2017. Minimax regret bounds for reinforcement learning. ar Xiv preprint ar Xiv:1703.05449 . Bartlett, M. S. 1951. An inverse matrix adjustment arising in discriminant analysis. The Annals of Mathematical Statistics 22(1). Chu, W.; Li, L.; Reyzin, L.; and Schapire, R. 2011. Contextual bandits with linear payoff functions. In Proceedings of the Fourteenth International Conference on Artiﬁcial Intelligence and Statistics, 208 214. Dani, V.; Hayes, T.; and Kakade, S. 2008. Stochastic linear optimization under bandit feedback. In Proceedings of the 21st Annual Conference on Learning Theory (COLT). Citeseer. Dann, C.; Li, L.; Wei, W.; and Brunskill, E. 2019. Policy certiﬁcates: Towards accountable reinforcement learning. In International Conference on Machine Learning, 1507 1516. Efroni, Y.; Merlis, N.; Ghavamzadeh, M.; and Mannor, S. 2019. Tight regret bounds for model-based reinforcement learning with greedy policies. In Advances in Neural Information Processing Systems, 12224 12234. Filippi, S.; Cappe, O.; Garivier, A.; and Szepesv ari, C. 2010. Parametric bandits: The generalized linear case. In Advances in Neural Information Processing Systems, 586 594. Gopalan, A.; and Mannor, S. 2015. Thompson sampling for learning parameterized Markov decision processes. In Conference on Learning Theory, 861 898. Jaksch, T.; Ortner, R.; and Auer, P. 2010. Near-optimal regret bounds for reinforcement learning. Journal of Machine Learning Research 11(Apr): 1563 1600. Jin, C.; Allen-Zhu, Z.; Bubeck, S.; and Jordan, M. I. 2018. Is Q-learning provably efﬁcient? In Advances in Neural Information Processing Systems, 4863 4873.

Kveton, B.; Zaheer, M.; Szepesvari, C.; Li, L.; Ghavamzadeh, M.; and Boutilier, C. 2020. Randomized exploration in generalized linear bandits. In International Conference on Artiﬁcial Intelligence and Statistics, 2066 2076. Levine, S.; Finn, C.; Darrell, T.; and Abbeel, P. 2016. Endto-end training of deep visuomotor policies. Journal of Machine Learning Research 17(1): 1334 1373. Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A. A.; Veness, J.; Bellemare, M.; Graves, A.; Riedmiller, M.; Fidjeland, A.; Ostrovski, G.; et al. 2015. Human-level control through deep reinforcement learning. Nature 518(7540): 529. Osband, I.; Russo, D.; and Van Roy, B. 2013. (More) efﬁcient reinforcement learning via posterior sampling. In Advances in Neural Information Processing Systems, 3003 3011. Osband, I.; and Van Roy, B. 2016. On lower bounds for regret in reinforcement learning. ar Xiv preprint ar Xiv:1608.02732 . Osband, I.; and Van Roy, B. 2017. Why is posterior sampling better than optimism for reinforcement learning? In International Conference on Machine Learning, 2701 2710. Puterman, M. L. 1994. Markov Decision Processes: Discrete Stochastic Dynamic Programming. John Wiley & Sons, Inc. Russo, D. 2019. Worst-case regret bounds for exploration via randomized value functions. In Advances in Neural Information Processing Systems, 14433 14443. Silver, D.; Schrittwieser, J.; Simonyan, K.; Antonoglou, I.; Huang, A.; Guez, A.; Hubert, T.; Baker, L.; Lai, M.; Bolton, A.; et al. 2017. Mastering the game of Go without human knowledge. Nature 550(7676): 354. Simchowitz, M.; and Jamieson, K. G. 2019. Non-asymptotic gap-dependent regret bounds for tabular mdps. In Advances in Neural Information Processing Systems, 1153 1162. Williams, R. J. 1992. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning 8(3-4): 229 256. Zanette, A.; and Brunskill, E. 2019. Tighter problemdependent regret bounds in reinforcement learning without domain knowledge using value function bounds. ar Xiv preprint ar Xiv:1901.00210 . Zimin, A.; and Neu, G. 2013. Online learning in episodic Markovian decision processes by relative entropy policy search. In NIPS, 1583 1591.