# consistent_online_offpolicy_evaluation__4cc3b8a4.pdf Consistent On-Line Off-Policy Evaluation Assaf Hallak 1 Shie Mannor 1 The problem of on-line off-policy evaluation (OPE) has been actively studied in the last decade due to its importance both as a stand-alone problem and as a module in a policy improvement scheme. However, most Temporal Difference (TD) based solutions ignore the discrepancy between the stationary distribution of the behavior and target policies and its effect on the convergence limit when function approximation is applied. In this paper we propose the Consistent Off-Policy Temporal Difference (COP-TD(λ, β)) algorithm that addresses this issue and reduces this bias at some computational expense. We show that COP-TD(λ, β) can be designed to converge to the same value that would have been obtained by using on-policy TD(λ) with the target policy. Subsequently, the proposed scheme leads to a related and promising heuristic we call log COP-TD(λ, β). Both algorithms have favorable empirical results to the current state of the art online OPE algorithms. Finally, our formulation sheds some new light on the recently proposed Emphatic TD learning. 1. Introduction Reinforcement Learning (RL) techniques were successfully applied in fields such as robotics, games, marketing and more (Kober et al., 2013; Al-Rawi et al., 2015; Barrett et al., 2013). We consider the problem of off-policy evaluation (OPE) assessing the performance of a complex strategy without applying it. An OPE formulation is often considered in domains with limited sampling capability. For example, marketing and recommender systems (Theocharous and Hallak, 2013; Theocharous et al., 2015) directly relate policies to revenue. A more extreme example is drug administration, as there are only few patients in 1The Technion, Haifa, Israel. Correspondence to: Assaf Hallak , Shie Mannor . Proceedings of the 34 th International Conference on Machine Learning, Sydney, Australia, PMLR 70, 2017. Copyright 2017 by the author(s). the testing population, and sub-optimal policies can have life threatening effects (Hochberg et al., 2016). OPE can also be useful as a module for policy optimization in a policy improvement scheme (Thomas et al., 2015a). In this paper, we consider the OPE problem in an on-line setup where each new sample is immediately used to update our current value estimate of some previously unseen policy. We propose and analyze a new algorithm called COP-TD(λ,β) for estimating the value of the target policy; COP-TD(λ,β) has the following properties: 1. Easy to understand and implement on-line. 2. Allows closing the gap to consistency such that the limit point is the same that would have been obtained by on-policy learning with the target policy. 3. Empirically comparable to state-of-the art algorithms. Our algorithm resembles (Sutton et al., 2015) s Emphatic TD that was extended by (Hallak et al., 2015) to the general parametric form ETD(λ,β). We clarify the connection between the algorithms and compare them empirically. Finally, we introduce an additional related heuristic called Log-COP-TD(λ,β) and motivate it. 2. Notations and Background We consider the standard discounted Markov Decision Process (MDP) formulation (Bertsekas and Tsitsiklis, 1996) with a single long trajectory. Let M = (S, A, P, R, ζ, γ) be an MDP where S is the finite state space and A is the finite action space. The parameter P sets the transition probabilities Pr(s |s, a) given the previous state s S and action a A, where the first state is determined by the distribution ζ. The parameter R sets the reward distribution r(s, a) obtained by taking action a in state s and γ is the discount factor specifying the exponential reduction in reward with time. The process advances as follows: A state s0 is sampled according to the distribution ζ(s). Then, at each time step t starting from t = 0 the agent draws an action at according to the stochastic behavior policy µ(a|st), a reward rt .= r(st, at) is accumulated by the agent, and the next state st+1 is sampled using the transition probability Pr(s |st, at). Consistent On-Line Off-Policy Evaluation The expected discounted accumulated reward starting from a specific state and choosing an action by some policy π is called the value function, which is also known to satisfy the Bellman equation in a vector form: V π(s) = Eπ t=0 γtrt s0 = s , TπV .= Rπ + γPπV, where [Rπ]s .= Eπ [r(s, π(s))] and [Pπ]s,s .= Eπ [Pr(s |s, π(s))] are the policy induced reward vector and transition probability matrix respectively; Tπ is called the Bellman operator. The problem of estimating V π(s) from samples is called policy evaluation. If the target policy π is different than the behavior policy µ which generated the samples, the problem is called off-policy evaluation (OPE). The TD(λ) (Sutton, 1988) algorithm is a standard solution to on-line on-policy evaluation: Each time step the temporal difference error updates the current value function estimate, such that eventually the stochastic approximation process will converge to the true value function. The standard form of TD(λ) is given by: R(n) t,st = i=0 γirt+i + γn ˆVt(st+n), Rλ t,st =(1 λ) n=0 λn R(n+1) st , ˆVt+1(st) = ˆVt(st) + αt Rλ t,st ˆVt(st) , where αt is the step size. The value R(n) t,st is an estimate of the current state s V (st), looking forward n steps, and Rλ t,st is an exponentially weighted average of all of these estimates going forward till infinity. Notice that Equation 1 does not specify an on-line implementation since R(n) t,st depends on future observations, however there exists a compact on-line implementation using eligibility traces (Bertsekas and Tsitsiklis (1996) for on-line TD(λ), and Sutton et al. (2014), Sutton et al. (2015) for off-policy TD(λ)). The underlying operator of TD(λ) is given by: T λ π V = (1 λ) i=0 γi P i πRπ + γn+1P n+1 π V = (1 λ)(I λTπ) 1TπV, and is a γ(1 λ) 1 λγ -contraction (Bertsekas, 2012). We denote by dµ(s) the stationary distribution over states induced by taking the policy µ and mark Dµ = diag(dµ). Since we are concerned with the behavior at infinite horizon, we assume ζ(s) = dµ(s). In addition, we assume that the MDP is ergodic for the two specified policies µ, π so s S : dµ(s) > 0, dπ(s) > 0 and that the OPE problem is proper π(a|s) > 0 µ(a|s) > 0. When the state space is too large to hold V π(s), a linear function approximation scheme is used: V π(s) θ π φ(s), where θ is the optimized weight vector and φ(s) is the feature vector of state s composed of k features. We denote by Πdπ the projection to the subspace spanned by the features with respect to the dπ-weighted norm, and by Φ RS,k the matrix whose lines consist of the feature vectors for each state and assume its columns are linearly independent. TD(λ) can be adjusted to find the fixed point of ΠdπT λ π (Sutton and Barto, 1998): R(n) t,st = i=0 γirt+i + γnθ t φ(st+n), Rλ t,st =(1 λ) n=0 λn R(n+1) st , θt+1 =θt + αt Rλ t,st θ t φ(st) φ(st). Finally, we define OPE-related quantities: ρt .= π(at|st) µ(at|st), Γn t .= i=0 ρt 1 i, ρd(s) .= dπ(s) we call ρd the covariate shift ratio (as denoted under different settings by (Hachiya et al., 2012)). We summarize the assumptions used in the proofs: 1. For both policies the induced Markov chain is ergodic. 2. The first state s0 is distributed according to the stationary distribution of the behavior policy dµ(s). 3. The problem is proper: π(a|s) > 0 µ(a|s) > 0. 4. The feature matrix Φ has full rank k. Assumption 1 is commonly used for convergence theorems as it verifies the value function is well defined on all states regardless of the initial sampled state. Assumption 2 can be relaxed since we are concerned with the long-term properties of the algorithm past its mixing time we require it for clarity of the proofs. Assumption 3 is required so the importance sampling ratios will be well defined. Assumption 4 guarantees the optimal θ is unique which greatly simplifies the proofs. 3. Previous Work We can roughly categorize previous OPE algorithms to two main families. Gradient based methods that perform stochastic gradient descent on error terms they want to minimize. These include GTD (Sutton et al., 2009a), GTD-2, Consistent On-Line Off-Policy Evaluation TDC (Sutton et al., 2009b) and HTD (White and White, 2016). The main disadvantages of gradient based methods are (A) they usually update an additional error correcting term, which means another time-step parameter needs to be controlled; and (B) they rely on estimating non-trivial terms, an estimate that tends to converge slowly. The other family uses importance sampling (IS) methods that correct the gains between on-policy and off-policy updates using the IS-ratios ρt s. Among these are full IS (Precup et al., 2001) and ETD(λ,β) (Sutton et al., 2015). These methods are characterized by the bias-variance trade-off they resort to navigating between biased convergent values (or even divergent), and very slow convergence stemming from the high variance of IS correcting factors (the ρt products). There are also a few algorithms that fall between the two, for example TO-GTD (van Hasselt et al., 2014) and WISTD(λ) (Mahmood and Sutton, 2015). A comparison of these algorithms in terms of convergence rate, synergy with function approximation and more is available in (White and White, 2016; Geist and Scherrer, 2014). We focus in this paper on the limit point of the convergence. For most of the aforementioned algorithms, the process was shown to converge almost surely to the fixed point of the projected Bellman operator Πd Tπ where d is some stationary distribution (usually dµ), however the d in question was never1 dπ as we would have obtained from running on-policy TD with the target policy (also see (Kolter, 2011) for relevant discussion). The algorithm achieving the closest result is ETD(λ,β) which replaced d with f = I βP π 1 dµ, where β trades-off some of the process variance with the bias in the limit point. Hence, our main contribution is a consistent algorithm which can converge to the same value that would have been obtained by running an on-policy scheme with the same policy. 4. Motivation Here we provide a motivating example showing that even in simple cases with close behavior and target policies, the two induced stationary distributions can differ greatly. Choosing a specific linear parameterization further emphasizes the difference between applying on-policy TD with the target policy, and applying inconsistent off-policy TD. Assume a chain MDP with numbered states 1, 2, ..|S|, where from each state s you can either move left to state s 1, or right to state s + 1. If you ve reached the beginning or the end of the chain (states 1 or |S|) then taking a step further does not affect your location. Assume the behavior policy moves left with probability 0.5 + ϵ, while the target policy moves right with probability 0.5+ϵ. It is easy 1Except full IS, however its variance is too high to be applicable in practice. to see that the stationary distributions are given by: dµ(s) 0.5 ϵ s , dπ(s) 0.5 + ϵ For instance, if we have a length 100 chain with ϵ = 0.01, for the rightmost state we have dµ(|S|) 8 10 4, dπ(|S|) 0.04. Let s set the reward to be 1 for the right half of the chain, so the target policy is better since it spends more time in the right half. The value of the target policy in the edges of the chain for γ = 0.99 is V π(1) = 0.21, V π(100) = 99.97. Now what happens if we try to approximate the value function using one constant feature φ(s) 1? The fixed point of ΠdµTπ is θ = 11.92, while the fixed point of ΠdπTπ is θ = 88.08 a substantial difference. The reason for this difference lies in the emphasis each projection puts on the states: according to Πdµ, the important states are in the left half of the chain these with low value function, and therefore the value estimation of all states is low. However, according to Πdπ the important states are concentrated on the right part of the chain since the target policy will visit these more often. Hence, the estimation error is emphasized on the right part of the chain and the value estimation is higher. When we wish to estimate the value of the target policy, we want to know what will happen if we deploy it instead of the behavior policy, thus taking the fixed point of ΠdπTπ better represents the off-policy evaluation solution. 5. COP-TD(λ, β) Most off-policy algorithms multiply the TD summand of TD(λ) with some value that depends on the history and the current state. For example, full IS-TD by (Precup et al., 2001) examines the ratio between the probabilities of the trajectory under both policies: Pπ(s0, a0, s1, . . . , st, at) Pµ(s0, a0, s1, . . . , st, at) = m=0 ρm = Γt tρt. (2) In problems with a long horizon, or these that start from the stationary distribution, we suggest using the time-invariant covariate shift ρd multiplied by the current ρt. The intuition is the following: We would prefer using the probabilities ratio given in Equation 2, but it has very high variance, and after many time steps we might as well look at the stationary distribution ratio instead. This direction leads us to the following update equations: θt + αtρd(st)ρt rt + θ t (γφ(st+1) φ(st)) φ(st). (3) Lemma 1. If the αt satisfy P t=0 αt = , P t=0 α2 t < then the process described by Eq. (3) converges almost surely to the fixed point of ΠπTπV = V . Consistent On-Line Off-Policy Evaluation The proof follows the ODE method (Kushner and Yin, 2003) similarly to Tsitsiklis and Van Roy (1997) (see the appendix for more details). Since ρd(s) is generally unknown, it is estimated using an additional stochastic approximation process. In order to do so, we note the following Lemma: Lemma 2. Let bρd be an unbiased estimate of ρd, and for every n = 0, 1, . . . , t define Γn t .= bρd(st n)Γn t . Then: Eµ h Γn t |st i = ρd(st). For any state st there are t such quantities { Γn t }t n=0, where we propose to weight them similarly to TD(λ): Γβ t = (1 β) n=0 βn Γn+1 t . Note that ρd(s), unlike V (s), is restricted to a close set since its dµ-weighted linear combination is equal to 1 and all of its entries are non-negative; We denote this dµweighted simplex by dµ, and let Π dµ be the (non-linear) projection to this set with respect to the Euclidean norm (Π dµ can be calculated efficiently, (Chen and Ye, 2011)). Now, we can devise a TD algorithm which estimates ρd and uses it to find θ, which we call COP-TD(0, β) (Consistent Off-Policy TD). Algorithm 1 COP-TD(0,β), Input: θ0, bρd,0, 1: Init: F0 = 0, nβ 0 = 1, N(s) = 0 2: for t = 1, 2, ... do 3: Observe st, at, rt, st+1 4: Update normalization terms: 5: N(st) = N(st) + 1, s S : ˆdµ(s) = N(s) t 6: nβ t = βnβ t + 1 7: Update Γn t s weighted average: 8: Ft = ρt 1(βFt 1 + est 1) 9: Update & project by ρd s TD error: 10: δd t = F t bρd,t nβ t | {z } 11: bρd,t+1 = Π ˆ dµ bρd,t + αd t δd t est 12: Off-policy TD(0): 13: δt = rt + θ t (γφ(st+1) φ(st)) 14: θt+1 = θt + αt bρd,t+1(st)ρtδtφ(st) 15: end for Similarly to the Bellman operator for TD-learning, we define the underlying COP-operator Y and its β extension: Y u = D 1 µ P π Dµu, Y βu = (1 β)D 1 µ P π (I βP π ) 1Dµu. The following Lemma may give some intuition on the convergence of the ρd estimation process: Lemma 3. Under the ergodicity assumption, denote the eigenvalues of Pπ by 0 |ξ2| < ξ1 = 1. Then Y β is a maxi =1 (1 β)|ξi| |1 βξi| < 1-contraction in the L2-norm on the orthogonal subspace to ρd, and ρd is a fixed point of Y β. The technical proof is given in the appendix. Theorem 1. If the step sizes satisfy P t αd t = , P t(α2 t + (αd t )2) < , αt αd t 0, tαd t 0, and E (βnΓn t )2|st C for some constant C and every t and n, then after applying COP-TD(0, β), bρd,t converges to ρd almost surely, and θt converges to the fixed point of ΠπTπV . Notice that COP-TD(0, β) given in Alg. 1 is infeasible in problems with large state spaces since ρd R|S|. Like TD(λ), we can introduce linear function approximation: represent ρd(s) θ ρ φρ(s) where θρ is a weight vector and φρ(s) is the off-policy feature vector and adjust the algorithm accordingly. For bρd to still be contained in the set dµ, we pose the requirement on the feature vectors: φρ(s) Rk +, and P s dµ(s)θ ρ φρ(s) = 1 noted as the simplex projection Π Eµ[φρ(s)] . In practice, the latter requirement can be approximated: P s dµ(s)θ ρ φρ(s) 1 t θ ρ P t φρ(st) = 1 resulting in an extension of the previously applied dµ estimation (step 5 in COP-TD(0, β)). We provide the full details in Algorithm 2, which also incorporates non-zero λ similarly to ETD(λ,β) . Algorithm 2 COP-TD(λ,β) with Function Approximation, Input: θ0, θρ,0 1: Init: F0 = 0, nβ 0 = 1, Nφ = 0, e0 = 0 2: for t = 1, 2, ... do 3: Observe st, at, rt, st+1 4: Update normalization terms: 5: nβ t = βnβ t + 1, Nφ = Nφ + φρ(st), ˆdφρ = Nφ t 6: Update Γn t s weighted average: 7: Ft = ρt 1(βFt 1 + φρ(st 1)) 8: Update & project by ρd s TD error: 9: δd t = θ ρ,t 1 Ft nβ t φρ(st) 10: θρ,t+1 = Π ˆ dφρ θρ,t + αd t δd t φρ(st) 11: Off-policy TD(λ): 12: Mt = λ + (1 λ)θ ρ,t+1φρ(st) 13: et = ρt (λγet + Mtφ(st+1)) 14: δt = rt + θ t (γφ(st+1) φ(st)) 15: θt+1 = θt + αtδtet 16: end for Theorem 2. If the step sizes satisfy P t αd t = , P t(α2 t + (αd t )2) < , αt αd t 0, tαd t 0, and E (βnΓn t )2|st C for some constant C and every t, n, Consistent On-Line Off-Policy Evaluation then after applying COP-TD(0, β) with function approximation satisfying φρ(s) Rk +, bρd,t converges to the fixed point of Π Eµ[φρ]ΠφρY β denoted by ρCOP d almost surely, and if θt converges it is to the fixed point of Πdµ ρCOP d TπV , where is a coordinate-wise product of vectors. The proof is given in the appendix and also follows the ODE method. Notice that a theorem is only given for λ = 0, convergence results for general λ should follow the work by Yu (2015). A possible criticism on COP-TD(0,β) is that it is not actually consistent, since in order to be consistent the original state space has to be small, in which case every off-policy algorithm is consistent as well. Still, the dependence on another set of features allows to trade-off accuracy with computational power in estimating ρd and subsequently V . Moreover, smart feature selection may further reduce this gap, and COP-TD(0, β) is still the first algorithm addressing this issue. We conclude with linking the error in ρd s estimate with the difference in the resulting θ, which suggests that a well estimated ρd results in consistency: Corollary 1. Let 0 < ϵ < 1. If (1 ϵ)ρd ρCOP d (1 + ϵ)ρd, then the fixed point of COP-TD(0,β) with function approximation θCOP satisfies the following, where is the L induced norm: ϵ A 1 π Φ Rmax + (1 + γ) Φ θCOP , where Aπ = Φ Dπ(I γPπ)Φ, and θ sets the fixed point of the operator ΠdπTπV . 5.1. Relation to ETD(λ, β) Recently, Sutton et al. (2015) had suggested an algorithm for off-policy evaluation called Emphatic TD. Their algorithm was later on extended by Hallak et al. (2015) and renamed ETD(λ, β), which was shown to perform extremely well empirically by White and White (2016). ETD(0, β) can be represented as: n=0 βnΓn t , θt+1 = θt + αt Ftρt rt + θ t (γφ(st+1) φ(st)) . As mentioned before, ETD(λ, β) converges to the fixed point of Πf T λ π (Yu, 2015), where f = E [Ft|st] = (I βPπ) 1dµ. Error bounds can be achieved by showing that the operator Πf T λ π is a contraction under certain requirements on β and that the variance of Ft is directly related to β as well (Hallak et al., 2015) (and thus affects the convergence rate of the process). When comparing ETD(λ,β) s form to COP-TD(λ,β) s, instead of spending memory and time resources on a state/feature-dependent Ft, ETD(λ,β) uses a one-variable approximation. The resulting Ft is in fact a one-step estimate of ρd, starting from bρd(s) 1 (see Equations 9, 4), up to a minor difference: F ETD t = βF COP-TD t + 1 (which following our logic adds bias to the estimate 2). Unlike ETD(λ, β), COP-TD(λ,β) s effectiveness depends on the available resources. The number of features φρ(s) can be adjusted accordingly to provide the most affordable approximation. The added cost is fine-tuning another stepsize, though β s effect is less prominent. 6. The Logarithm Approach for Handling Long Products We now present a heuristic algorithm which works similarly to COP-TD(λ, β). Before presenting the algorithm, we explain the motivation behind it. 6.1. Statistical Interpretation of TD(λ) Konidaris et al. (2011) suggested a statistical interpretation of TD(λ). They show that under several assumptions the TD(λ) estimate Rλ st is the maximum likelihood estimator of V (st) given Rn st: (1) Each Rn st is an unbiased estimator of V (st); (2) The random variables Rn st are independent and specifically uncorrelated; (3) The random variables Rn st are jointly normally distributed; and (4) The variance of each Rn st is proportional to λn. Under Assumptions 1-3 the maximum likelihood estimator of V (s) given its previous estimate can be represented as a linear convex combination of Rn st with weights: h Var R(n) st i 1 P m=0 h Var R(m) st i 1 . Subsequently, in Konidaris et al. (2011) Assumption 4 was relaxed and instead a closed form approximation of the variance was proposed. In a follow-up paper by Thomas et al. (2015b), the second assumption was also removed and the weights were instead given as: wn = 1 cov(Rst)en 1 cov(Rst)1 , where the covariance matrix can be estimated from the data, or otherwise learned through some parametric form. While both the approximated variance and learned covariance matrix solutions improve performance on several benchmarks, the first uses a rather crude approximation, and the second solution is both state-dependent and based on noisy estimates of the covariance matrix. In addition, there aren t efficient on-line implementations since all past 2We have conducted several experiments with an altered ETD and indeed obtained better results compared with the original, these experiments are outside the scope of the paper. Consistent On-Line Off-Policy Evaluation weights should be recalculated to match a new sample. Still, the suggested statistical justification is a valuable tool in assessing the similar role of β in ETD(λ, β). 6.2. Variance Weighted Γn t As was shown by Konidaris et al. (2011), we can use statedependent weights instead of β exponents to obtain better estimates. The second moments are given explicitly as follows3: E h (Γn t )2 |st i = d µ P n 1est dµ(st) , where h P i s,s = P a A π2(a|s) µ(a|s) P(s |s, a). These can be estimated for each state separately. Notice that the variances increase exponentially depending on the largest eigenvalue of P (as Assumption 4 dictates), but this is merely an asymptotic behavior and may be relevant only when the weights are already negligible. Hence, implementing this solution on-line should not be a problem with the varying weights, as generally only the first few of these are non-zero. While this solution is impractical in problems with large state spaces parameterizing or approximating these variances (similarly to Thomas et al. (2015b)) could improve performance in specific applications. 6.3. Log-COP-TD(λ, β) Assumption 3 in the previous section is that the sampled estimators (R(n), Γn t ) are normally distributed. For on policy TD(λ), this assumption might seem not too harsh as the estimators R(n) represent growing sums of random variables. However, in our case the estimators Γn t are growing products of random variables. To correct this issue we can define new estimators using a logarithm on each Γn t : log [ρd(st)] = log k=t m ρk st log [ bρd(st m)] + k=t m E [log [ρk] |st] . This approximation is crude we could add terms reducing the error through Taylor expansion, but these would be complicated to deal with. Hence, we can relate to this method mainly as a well-motivated heuristic. Notice that this formulation resembles the standard MDP formulation, only with the corresponding reward terms log[ρt] going backward instead of forward, and no discount factor. Unfortunately, without a discount factor we 3The covariances can be expressed analytically as well, for clarity we drop this immediate result. cannot expect the estimated value to converge, so we propose using an artificial one γlog. We can incorporate function approximation for this formulation as well. Unlike COP-TD(λ, β), we can choose the features and weights as we wish with no restriction, besides the linear constraint on the resulting ρd through the weight vector θρ. This can be approximately enforced by normalizing θρ using X t exp(θ ρ,tφ(st)) (which should equal 1 if we were exactly correct). We call the resulting algorithm Log COP-TD(λ,β). Algorithm 3 Log-COP-TD(λ,β) with Function Approximation, Input: θ0,θρ,0 1: Init: F0 = 0, n0(β) = 1, N(s) = 0 2: for t = 1, 2, ... do 3: Observe st, at, rt, st+1 4: Update normalization terms: 5: nβ t = βnβ t + 1, Nφ = γlog(βNφ + φρ(st)), X = X + exp(θ ρ,tφ(st)) 6: Update log(Γn t ) s weighted average: 7: Ft = βγlog Ft 1 + nβ t log[ρ(st 1)] 8: Update & project by log(ρd) s TD error: 9: δd t = Ft nβ t + θ ρ,t Nφ nβ t φρ(st) 10: θρ,t+1 = θρ,t + αd t δd t φρ(st) 11: Off-policy TD(λ): 12: Mt = λ + (1 λ) exp θ ρ,t+1φρ(st) /(X/t) 13: et = ρt (λγet + Mtφ(st+1)) 14: δt = rt + θ t (γφ(st+1) φ(st)) 15: θt+1 = θt + αtδtet 16: end for 6.4. Using the Original Features An interesting phenomenon occurs when the behavior and target policies employ a feature based Boltzmann distribution for choosing the actions: µ(a|s) = exp θ a,µφ(s) , and π(a|s) = exp θ a,πφ(s) , where a constant feature is added to remove the (possibly different) normalizing constant. Thus, log(ρt) = (θa,π θa,µ) φ(st), and Log COP-TD(λ,β) obtains a parametric form that depends on the original features instead of a different set. 6.5. Approximation Hardness As we propose to use linear function approximation for ρd(s) and log (ρd(s)) one cannot help but wonder how hard it is to approximate these quantities, especially compared to the value function. The comparison between V (s) and ρd(s) is problematic for several reasons: 1. The ultimate goal is estimating V π(s), approximation errors in ρd(s) are second order terms. 2. The value function V π(s) depends on the policy- Consistent On-Line Off-Policy Evaluation induced reward function and transition probability matrix, while ρd(s) depends on the stationary distributions induced by both policies. Since each depends on at least one distinct factor - we can expect different setups to result in varied approximation hardness. For example, if the reward function has a poor approximation then so will V π(s), while extremely different behavior and target policies can cause ρd(s) to behave erratically. 3. Subsequently, the choice of features for approximating V π(s) and ρd(s) can differ significantly depending on the problem at hand. If we would still like to compare V π(s) and ρd(s), we could think of extreme examples: When π = µ, ρd(s) 1, when R(s) 0 then V π(s) 0. In the chain MDP example in Section 4 we saw that ρd(s) is an exponential function of the location in the chain. Setting reward in one end to 1 will result in an exponential form for V π(s) as well. Subsequently, in the chain MDP example approximating log (ρd(s)) is easier than ρd(s) as we obtain a linear function of the position; This is not the general case. 7. Experiments We have performed 3 types of experiments. Our first batch of experiments (Figure 1) demonstrates the accuracy of predicting ρd by both COP-TD(λ, β) and Log-COP-TD(λ, β). We show two types of setups in which visualization of ρd is relatively clear - the chain MDP example mentioned in Section 4 and the mountain car domain (Sutton and Barto, 1998) in which the state is determined by only two continuous variables - the car s position and speed. The parameters λ and β exhibited low sensitivity in these tasks so they were simply set to 0, we show the estimated ρd after 106 iterations. For the chain MDP (top two plots, notice the logarithmic scale) we first approximate ρd without any function approximation (top-left) and we can see COP-TD manages to converge to the correct value while Log-COPTD is much less exact. When we use linear feature space (constant parameter and position) Log-COP-TD captures the true behavior of ρd much better as expected. The two lower plots show the error (in color) in ρd estimated for the mountain car with a pure exploration behavior policy vs. a target policy oriented at moving right. The z-axis is the same for both plots and it describes a much more accurate estimate of ρd obtained through simulations. The features used were local state aggregation. We can see that both algorithms succeed similarly on the position-speed pairs which are sampled often due to the behavior policy and the Figure 1. Estimation quality of COP-TD and Log-COP-TD in the chain MDP (top) and mountain car (bottom) problems. The chain MDP plots differ by the function approximation and the shading reflects one standard deviation over 10 trajectories. The mountain car plots compare COP-TD with Log-COP-TD where the z-axis is the same (true ρd) with the colors specifying the error. 0 20 40 60 80 100 10 4 Chain MDP 100 states, no func. approx. 0 20 40 60 80 100 10 4 Chain MDP 100 states, linear func. approx. 1.5 1 0.5 0 0.5 Mountain car ρd estimation COP 1.5 1 0.5 0 0.5 Mountain car ρd estimation log COP mountain. When looking at more rarely observed states, the estimate becomes worse for both algorithms, though Log-COP-TD seems to be better performing on the spike at position > 0. Next we test the sensitivity of COP-TD(λ, β) and Log COP-TD(λ,β) to the parameters β and γlog (Figure 2) on two distinct toy examples - the chain MDP introduced before but with only 30 states with the position-linear features, and a random MDP with 32 states, 2 actions and a 5-bit binary feature vector along with a free parameter (this compact representation was suggested by White and White (2016) to approximate real world problems). The policies on the chain MDP were taken as described before, and on the random MDP a state independent 0.75/0.25 probability to choose an action by the behavior/target policy. As we can see, larger values of β cause noisier estimations in the random MDP for COP-TD(λ, β), but has little effect in other venues. As for γlog - we can see that if it is too large or too small the error behaves sub-optimally, as expected for the crude approximation of Equation 5. In conclusion, unlike ETD(λ, β), Log/COP-TD(λ, β) are much less effected by β, though γlog should be tuned to improve results. Our final experiment (Figure 3) compares our algorithms to ETD(λ, β) and GTD(λ, β) over 4 setups: chain MDP with 100 states with right half rewards 1 with linear features, a 2 action random MDP with 256 states and binary features, acrobot (3 actions) and cart-pole balancing (21 actions) (Sutton and Barto, 1998) with reset at success and state aggregation to 100 states. In all problems we used the same features for ρd and V π(s) estimation, γ = 0.99, constant step size 0.05 for the TD process and results were averaged over 10 trajectories, other parameters (λ, β, other step sizes, γlog) were swiped over to find the best ones. To Consistent On-Line Off-Policy Evaluation Figure 2. The effect of β, γlog on COP-TD(λ,β) and Log-COPTD(λ,β), the y-axis is ρd s estimation sum of squared errors (SSE) over all states. 0 5 10 x 10 4 10 0 SSE, COP TD β sweep 30 states Chain MDP 0 5 10 x 10 4 10 2 10 2 Random MDP 32 states β = 0 β = 0.25 β = 0.5 β = 0.75 β = 0.99 0 5 10 x 10 4 10 0 SSE, Log COP TD β sweep 0 2000 4000 6000 10 0 β = 0 β = 0.25 β = 0.5 β = 0.75 β = 0.99 0 5 10 x 10 4 10 0 SSE, Log COP TD γlog sweep 0 0.5 1 1.5 2 x 10 4 10 10 γlog = 0.99 γlog = 0.9999 reduce figure clutter we have not included standard deviations though the noisy averages still reflect the variance in the process. Our method of comparison on the first 2 setups estimates the value function using the suggested algorithm, and finds the dπ weighted average of the error between V and the on-policy fixed point ΠπTVπ: ˆV ΠπTVπ 2 dπ = X s dπ(s) h (θ ˆθ) φ(s) i2 , where θ is the optimal θ obtained by on-policy TD using the target policy. On the latter continuous state problems we applied on-line TD on a different trajectory following the target policy, used the resulting θ value as ground truth and taken the sum of squared errors with respect to it. The behavior and target policies for the chain MDP and random MDP are as specified before. For the acrobot problem the behavior policy is uniform over the 3 actions and the target policy chooses between these with probabilities ( 1 2). For the cart-pole the action space is divided to 21 actions from -1 to 1 equally, the behavior policy chooses among these uniformly while the target policy is 1.5 times more prone to choosing a positive action than a negative one. The experiments show that COP-TD(λ, β) and Log-COPTD(λ, β) have comparable performance to ETD(λ, β) where at least one is better in every setup. The advantage in the new algorithms is especially seen in the chain MDP corresponding to a large discrepancy between the stationary distribution of the behavior and target policy. GTD(λ) is consistently worse on the tested setups, this might be due to the large difference between the chosen behavior and target policies which affects GTD(λ) the most. Figure 3. Error over time of several on-line off-policy algorithms. 0 5 10 x 105 10 1 ETD GTD COP TD Log COP TD 0 0.5 1 1.5 2 x 104 10 2 104 Random MDP 0 5 10 x 10 5 103 0 5 10 x 10 5 109 8. Conclusion Research on off-policy evaluation has flourished in the last decade. While a plethora of algorithms were suggested so far, ETD(λ, β) by Hallak et al. (2015) has perhaps the simplest formulation and theoretical properties. Unfortunately, ETD(λ, β) does not converge to the same point achieved by on-line TD when linear function approximation is applied. We address this issue with COP-TD(λ,β) and proved it can achieve consistency when used with a correct set of features, or at least allow trading-off some of the bias by adding or removing features. Despite requiring a new set of features and calibrating an additional update function, COP-TD(λ,β) s performance does not depend as much on β as ETD(λ,β), and shows promising empirical results. We offer a connection to the statistical interpretation of TD(λ) that motivates our entire formulation. This interpretation leads to two additional approaches: (a) weight the Γn t using estimated variances instead of β exponents and (b) approximating log[ρd] instead of ρd; both approaches deserve consideration when facing a real application. 9. Acknowledgments This Research was supported in part by the Israel Science Foundation (grant No. 920/12) and by the European Research Council under the European Union s Seventh Framework Programme (FP/2007-2013)/ ERC Grant Agreement n.306638. Consistent On-Line Off-Policy Evaluation Hasan AA Al-Rawi, Ming Ann Ng, and Kok-Lim Alvin Yau. Application of reinforcement learning to routing in distributed wireless networks: a review. Artificial Intelligence Review, 43(3):381 416, 2015. Enda Barrett, Enda Howley, and Jim Duggan. Applying reinforcement learning towards automating resource allocation and application scalability in the cloud. Concurrency and Computation: Practice and Experience, 25 (12):1656 1674, 2013. D. Bertsekas. Dynamic Programming and Optimal Control, Vol II. Athena Scientific, 4th edition, 2012. D. Bertsekas and J. Tsitsiklis. Neuro-Dynamic Programming. Athena Scientific, 1996. D. Bertsekas and H. Yu. Projected equation methods for approximate solution of large linear systems. Journal of Computational and Applied Mathematics, 227(1):27 50, 2009. Shalabh Bhatnagar, Vivek S Borkar, and LA Prashanth. Adaptive feature pursuit: Online adaptation of features in reinforcement learning. Reinforcement Learning and Approximate Dynamic Programming for Feedback Control, pages 517 534, 2012. Shalabh Bhatnagar, Vivek S Borkar, and KJ Prabuchandran. Feature search in the grassmanian in online reinforcement learning. IEEE Journal of Selected Topics in Signal Processing, 7(5):746 758, 2013. Wendelin B ohmer, Steffen Gr unew alder, Yun Shen, Marek Musial, and Klaus Obermayer. Construction of approximation spaces for reinforcement learning. Journal of Machine Learning Research, 14(1):2067 2118, 2013. Justin A Boyan. Least-squares temporal difference learning. In ICML, pages 49 56, 1999. Steven J Bradtke and Andrew G Barto. Linear least-squares algorithms for temporal difference learning. Machine learning, 22(1-3):33 57, 1996. Yunmei Chen and Xiaojing Ye. Projection onto a simplex. ar Xiv preprint ar Xiv:1101.6081, 2011. Christoph Dann, Gerhard Neumann, and Jan Peters. Policy evaluation with temporal differences: a survey and comparison. Journal of Machine Learning Research, 15(1): 809 883, 2014. Dotan Di Castro and Shie Mannor. Adaptive bases for reinforcement learning. Machine Learning and Knowledge Discovery in Databases, pages 312 327, 2010. Amir M Farahmand, Mohammad Ghavamzadeh, Shie Mannor, and Csaba Szepesv ari. Regularized policy iteration. In Advances in Neural Information Processing Systems, pages 441 448, 2009. Clement Gehring, Yangchen Pan, and Martha White. Incremental truncated lstd. ar Xiv preprint ar Xiv:1511.08495, 2015. Matthieu Geist and Bruno Scherrer. l1-penalized projected bellman residual. In European Workshop on Reinforcement Learning, pages 89 101. Springer, 2011. Matthieu Geist and Bruno Scherrer. Off-policy learning with eligibility traces: A survey. The Journal of Machine Learning Research, 15(1):289 333, 2014. Matthieu Geist, Bruno Scherrer, Alessandro Lazaric, and Mohammad Ghavamzadeh. A dantzig selector approach to temporal difference learning. ar Xiv preprint ar Xiv:1206.6480, 2012. Mohammad Ghavamzadeh, Alessandro Lazaric, Odalric Maillard, and R emi Munos. Lstd with random projections. In Advances in Neural Information Processing Systems, pages 721 729, 2010. Sertan Girgin and Philippe Preux. Basis expansion in natural actor critic methods. In European Workshop on Reinforcement Learning, pages 110 123. Springer, 2008. Arash Givchi and Maziar Palhang. Off-policy temporal difference learning with distribution adaptation in fast mixing chains. Soft Computing, pages 1 14, 2017. Hirotaka Hachiya and Masashi Sugiyama. Feature selection for reinforcement learning: Evaluating implicit state-reward dependency via conditional mutual information. Machine Learning and Knowledge Discovery in Databases, pages 474 489, 2010. Hirotaka Hachiya, Masashi Sugiyama, and Naonori Ueda. Importance-weighted least-squares probabilistic classifier for covariate shift adaptation with application to human activity recognition. Neurocomputing, 80:93 101, 2012. Assaf Hallak, Aviv Tamar, Remi Munos, and Shie Mannor. Generalized emphatic temporal difference learning: Bias-variance analysis. ar Xiv preprint ar Xiv:1509.05172, 2015. Irit Hochberg, Guy Feraru, Mark Kozdoba, Shie Mannor, Moshe Tennenholtz, and Elad Yom-Tov. Encouraging physical activity in patients with diabetes through automatic personalized feedback via reinforcement learning improves glycemic control. Diabetes care, 39(4):e59 e60, 2016. Consistent On-Line Off-Policy Evaluation Matthew W Hoffman, Alessandro Lazaric, Mohammad Ghavamzadeh, and R emi Munos. Regularized least squares temporal difference learning with nested l2 and l1 penalization. In European Workshop on Reinforcement Learning, pages 102 114. Springer, 2011. Jeff Johns and Sridhar Mahadevan. Constructing basis functions from directed graphs for value function approximation. In Proceedings of the 24th international conference on Machine learning, pages 385 392. ACM, 2007. Jeffrey Johns, Christopher Painter-Wakefield, and Ronald Parr. Linear complementarity for regularized policy evaluation and improvement. In Advances in neural information processing systems, pages 1009 1017, 2010. Takafumi Kanamori, Shohei Hido, and Masashi Sugiyama. A least-squares approach to direct importance estimation. Journal of Machine Learning Research, 10(Jul): 1391 1445, 2009. Jens Kober, J Andrew Bagnell, and Jan Peters. Reinforcement learning in robotics: A survey. The International Journal of Robotics Research, page 0278364913495721, 2013. J Zico Kolter. The fixed points of off-policy TD. In NIPS, 2011. J Zico Kolter and Andrew Y Ng. Regularization and feature selection in least-squares temporal difference learning. In Proceedings of the 26th annual international conference on machine learning, pages 521 528. ACM, 2009. George Konidaris, Scott Niekum, and Philip S Thomas. Td-gamma: Re-evaluating complex backups in temporal difference learning. In Advances in Neural Information Processing Systems 24, pages 2402 2410. Curran Associates, Inc., 2011. URL http://papers.nips.cc/paper/4472-td_ gamma-re-evaluating-complex-backups-in-temporal-difference-learning. pdf. Mark Kroon and Shimon Whiteson. Automatic feature selection for model-based reinforcement learning in factored mdps. In Machine Learning and Applications, 2009. ICMLA 09. International Conference on, pages 324 330. IEEE, 2009. Harold Kushner and G George Yin. Stochastic approximation and recursive algorithms and applications, volume 35. Springer Science & Business Media, 2003. Bo Liu, Sridhar Mahadevan, and Ji Liu. Regularized offpolicy td-learning. In Advances in Neural Information Processing Systems, pages 836 844, 2012. De-Rong Liu, Hong-Liang Li, and Ding Wang. Feature selection and feature learning for high-dimensional batch reinforcement learning: a survey. International Journal of Automation and Computing, 12(3):229 242, 2015. Manuel Loth, Manuel Davy, and Philippe Preux. Sparse temporal difference learning using lasso. In Approximate Dynamic Programming and Reinforcement Learning, 2007. ADPRL 2007. IEEE International Symposium on, pages 352 359. IEEE, 2007. Sridhar Mahadevan. Samuel meets amarel: Automating value function approximation using global state space analysis. In AAAI, volume 5, pages 1000 1005, 2005. Sridhar Mahadevan and Bo Liu. Sparse q-learning with mirror descent. ar Xiv preprint ar Xiv:1210.4893, 2012. Sridhar Mahadevan and Mauro Maggioni. Proto-value functions: A laplacian framework for learning representation and control in markov decision processes. Journal of Machine Learning Research, 8(Oct):2169 2231, 2007. Sridhar Mahadevan et al. Learning representation and control in markov decision processes: New frontiers. Foundations and Trends R in Machine Learning, 1(4):403 565, 2009. A Rupam Mahmood and Richard S Sutton. Off-policy learning based on weighted importance sampling with linear computational complexity. In Conference on Uncertainty in Artificial Intelligence, 2015. A Rupam Mahmood, Hado P van Hasselt, and Richard S Sutton. Weighted importance sampling for off-policy learning with linear function approximation. In Advances in Neural Information Processing Systems, pages 3014 3022, 2014. Ishai Menache, Shie Mannor, and Nahum Shimkin. Basis function adaptation in temporal difference reinforcement learning. Annals of Operations Research, 134(1):215 238, 2005. Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control through deep reinforcement learning. Nature, 518(7540):529 533, 2015. Christopher Painter-Wakefield and Ronald Parr. Greedy algorithms for sparse reinforcement learning. ar Xiv preprint ar Xiv:1206.6485, 2012. Ronald Parr, Christopher Painter-Wakefield, Lihong Li, and Michael Littman. Analyzing feature generation for value-function approximation. In Proceedings of Consistent On-Line Off-Policy Evaluation the 24th international conference on Machine learning, pages 737 744. ACM, 2007. Ronald Parr, Lihong Li, Gavin Taylor, Christopher Painter Wakefield, and Michael L Littman. An analysis of linear models, linear value-function approximation, and feature selection for reinforcement learning. In Proceedings of the 25th international conference on Machine learning, pages 752 759. ACM, 2008. Marek Petrik. An analysis of laplacian methods for value function approximation in mdps. In IJCAI, pages 2574 2579, 2007. Marek Petrik, Gavin Taylor, Ron Parr, and Shlomo Zilberstein. Feature selection using regularization in approximate linear programs for markov decision processes. ar Xiv preprint ar Xiv:1005.1860, 2010. Doina Precup, Richard S Sutton, and Sanjoy Dasgupta. Off-policy temporal-difference learning with function approximation. In ICML, 2001. Zhiwei Qin, Weichang Li, and Firdaus Janoos. Sparse reinforcement learning via convex optimization. In Proceedings of the 31st International Conference on Machine Learning (ICML-14), pages 424 432, 2014. Zeev Schuss and Vivek S Borkar. Stochastic approximation: A dynamical systems viewpoint, 2009. William D Smart. Explicit manifold representations for value-function approximation in reinforcement learning. In ISAIM, 2004. Yi Sun, Mark Ring, J urgen Schmidhuber, and Faustino J Gomez. Incremental basis construction from temporal difference error. In Proceedings of the 28th International Conference on Machine Learning (ICML-11), pages 481 488, 2011. R. S. Sutton and A. Barto. Reinforcement learning: An introduction. Cambridge Univ Press, 1998. R. S. Sutton, A. R. Mahmood, and M White. An emphatic approach to the problem of off-policy temporaldifference learning. ar Xiv:1503.04269, 2015. Rich Sutton, Ashique R Mahmood, Doina Precup, and Hado V Hasselt. A new q (lambda) with interim forward view and monte carlo equivalence. In Proceedings of the 31st International Conference on Machine Learning (ICML-14), pages 568 576, 2014. Richard S Sutton. Learning to predict by the methods of temporal differences. Machine learning, 3(1):9 44, 1988. Richard S Sutton, Hamid R Maei, and Csaba Szepesv ari. A convergent o(n) temporal-difference algorithm for off-policy learning with linear function approximation. In Advances in neural information processing systems, pages 1609 1616, 2009a. Richard S Sutton, Hamid Reza Maei, Doina Precup, Shalabh Bhatnagar, David Silver, Csaba Szepesv ari, and Eric Wiewiora. Fast gradient-descent methods for temporaldifference learning with linear function approximation. In Proceedings of the 26th Annual International Conference on Machine Learning, pages 993 1000. ACM, 2009b. Georgios Theocharous and Assaf Hallak. Lifetime value marketing using reinforcement learning. RLDM 2013, page 19, 2013. Georgios Theocharous, Philip S Thomas, and Mohammad Ghavamzadeh. Personalized ad recommendation systems for life-time value optimization with guarantees. In Proceedings of the Twenty-Fourth International Joint Conference on Artificial Intelligence (IJCAI-15), 2015. Philip Thomas, Georgios Theocharous, and Mohammad Ghavamzadeh. High confidence policy improvement. In Proceedings of the 32nd International Conference on Machine Learning (ICML-15), pages 2380 2388, 2015a. Philip S Thomas, Scott Niekum, Georgios Theocharous, and George Konidaris. Policy evaluation using the omega-return. In Advances in Neural Information Processing Systems 28, pages 334 342. Curran Associates, Inc., 2015b. URL http://papers.nips.cc/paper/ 5807-policy-evaluation-using-the-return. pdf. John N Tsitsiklis and Benjamin Van Roy. An analysis of temporal-difference learning with function approximation. Automatic Control, IEEE Transactions on, 42(5): 674 690, 1997. Hado van Hasselt, A Rupam Mahmood, and Richard S Sutton. Off-policy td (λ) with a true online equivalence. In Proceedings of the 30th Conference on Uncertainty in Artificial Intelligence, Quebec City, Canada, 2014. Jian Wang, Zhenhua Huang, and Xin Xu. A novel approach for constructing basis functions in approximate dynamic programming for feedback control. In Adaptive Dynamic Programming And Reinforcement Learning (ADPRL), 2013 IEEE Symposium on, pages 47 51. IEEE, 2013. Adam White and Martha White. Investigating practical, linear temporal difference learning. ar Xiv preprint ar Xiv:1602.08771, 2016. Consistent On-Line Off-Policy Evaluation Dean S Wookey and George D Konidaris. Regularized feature selection in reinforcement learning. Machine Learning, 100(2-3):655 676, 2015. Dean Stephen Wookey. Representation discovery using a fixed basis in reinforcement learning. Ph D thesis, University of the Witwatersrand South Africa, 2016. H. Yu. On convergence of emphatic temporal-difference learning. In COLT, 2015. Huizhen Yu. Convergence of least squares temporal difference methods under general conditions. In Proceedings of the 27th International Conference on Machine Learning (ICML-10), pages 1207 1214, 2010. Tom Zahavy, Nir Ben-Zrihem, and Shie Mannor. Graying the black box: Understanding dqns. ar Xiv preprint ar Xiv:1602.02658, 2016.