# averagereward_offpolicy_policy_evaluation_with_function_approximation__7db6a1da.pdf Average-Reward Off-Policy Policy Evaluation with Function Approximation Shangtong Zhang 1 * Yi Wan 2 * Richard S. Sutton 2 Shimon Whiteson 1 We consider off-policy policy evaluation with function approximation (FA) in average-reward MDPs, where the goal is to estimate both the reward rate and the differential value function. For this problem, bootstrapping is necessary and, along with off-policy learning and FA, results in the deadly triad (Sutton & Barto, 2018). To address the deadly triad, we propose two novel algorithms, reproducing the celebrated success of Gradient TD algorithms in the average-reward setting. In terms of estimating the differential value function, the algorithms are the first convergent off-policy linear function approximation algorithms. In terms of estimating the reward rate, the algorithms are the first convergent offpolicy linear function approximation algorithms that do not require estimating the density ratio. We demonstrate empirically the advantage of the proposed algorithms, as well as their nonlinear variants, over a competitive density-ratio-based approach, in a simple domain as well as challenging robot simulation tasks. 1. Introduction A fundamental problem in average-reward Markov Decision Processes (MDPs, see, e.g., Puterman (1994)) is policy evaluation, that is, estimating, for a given policy, the reward rate and the differential value function. The reward rate of a policy is the average reward per step and thus measures the policy s long term performance. The differential value function summarizes the expected cumulative future excess rewards, which are the differences between received rewards and the reward rate. The solution of the policy evaluation problem is interesting in itself because it provides a useful performance metric, the reward rate, for a given policy. In addition, it is an essential part of many control algorithms, *Equal contribution 1University of Oxford 2University of Alberta. Correspondence to: Shangtong Zhang < shangtong.zhang@cs.ox.ac.uk>, Yi Wan . Proceedings of the 38 th International Conference on Machine Learning, PMLR 139, 2021. Copyright 2021 by the author(s). which aim to generate a policy that maximizes the reward rate by iteratively improving the policy using its estimated differential value function (see, e.g., Howard (1960); Konda (2002); Abbasi-Yadkori et al. (2019)). One typical approach in policy evaluation is to learn from real experience directly, without knowing or learning a model. If the policy followed to generate experience (behavior policy) is the same as the policy of interest (target policy), then this approach yields an on-policy method; otherwise, it is off-policy. Off-policy methods are usually more practical in settings in which following bad policies incurs prohibitively high cost (Dulac-Arnold et al., 2019). For policy evaluation, we can use either tabular methods, which maintain a look-up table to store quantities of interest (e.g., the differential values for all states) separately, or use function approximation, which represents these quantities collectively, possibly in a more efficient way (e.g., using a neural network). Function approximation methods are necessary for MDPs with large state and/or action spaces because they are scalable in the size of these spaces and also generalize to states and actions that are not in the data (Mnih et al., 2015; Silver et al., 2016). Finally, for the policy evaluation problem in average reward MDPs, the agent s stream of experience never terminates and thus actual returns cannot be obtained. Because of this, learning algorithms have to bootstrap, that is, the estimated values must be updated towards targets that include existing estimated values instead of actual returns. In this paper, we consider methods for solving the averagereward policy evaluation problem with all the above three elements (off-policy learning, function approximation and bootstrapping), which comprise the deadly triad (see Chapter 11 of Sutton & Barto (2018) and Section 3). The main contributions of this paper are two newly proposed methods to break this deadly triad in the average-reward setting, both of which are inspired by the celebrated success of the Gradient TD family of algorithms (Sutton et al., 2009b;a) in breaking the deadly triad in the discounted setting. Few methods exist for learning differential value functions. These are either on-policy linear function approximation methods (Tsitsiklis & Van Roy, 1999; Konda, 2002; Yu & Bertsekas, 2009; Abbasi-Yadkori et al., 2019) or off-policy tabular methods (Wan et al., 2020). The on-policy methods Average-Reward Off-Policy Policy Evaluation with Function Approximation use the empirical average of received rewards as an estimate for the reward rate. Thus they are not straightforward to extend to the off-policy case. And, as we show later with a counterexample, the naive extension of the off-policy tabular method by Wan et al. (2020) to the linear function approximation setting can diverge, exemplifying the deadly triad. By contrast, the two algorithms we propose are the first provably convergent methods for learning the differential value function via off-policy linear function approximation. All existing methods for estimating reward rate in off-policy function approximation setting require learning the density ratio, i.e., the ratio between the stationary distribution of the target policy and the sampling distribution (Liu et al., 2018a; Zhang et al., 2020a;b; Mousavi et al., 2020; Lazic et al., 2020). Interestingly, while density-ratio-based methods dominate off-policy policy evaluation with function approximation in average-reward MDPs, in the discounted MDPs, both density-ratio-based (Hallak & Mannor, 2017; Liu et al., 2018a; Gelada & Bellemare, 2019; Nachum et al., 2019a; Uehara & Jiang, 2019; Xie et al., 2019; Tang et al., 2019; Zhang et al., 2020a;b) and value-based (Baird, 1995; Sutton et al., 2009b;a; 2016; Thomas et al., 2015; Jiang & Li, 2015) methods have succeeded. It thus remains unknown whether a convergent value-based method could be found for such a problem and if it exists, how it performs compared with density-ratio-based methods. The two algorithms we propose are the first provably convergent differential-valuebased methods for reward rate estimation via off-policy linear function approximation, which answer the question affirmatively. Furthermore, our empirical study shows that our value-based methods consistently outperform a competitive density-ratio-based approach, Gradient DICE (Zhang et al., 2020b), in the tested domains, including both a simple Markov chain and challenging robot simulation tasks. 2. Background In this paper, we use M to denote the vector norm induced by a positive definite matrix M, i.e., x M = x Mx. We also use M to denote the corresponding induced matrix norm. When M = I, we ignore the subscript I and write for simplicity. All vectors are column vectors. 0 denotes an all-zero vector whose dimension can be deduced from the context. 1 is similarly defined. When it does not confuse, we use a function and a vector interchangeably. For example, if f is a function from X to R, we also use f to denote the corresponding vector in R|X|. We consider an infinite horizon MDP with a finite state space S, a finite action space A, a reward function r : S A R, and a transition kernel p : S S A [0, 1]. When an agent follows a policy π : A S [0, 1] in the MDP, at time step t, the agent observes a state St, takes an action At π( |St), receives a reward r(St, At), proceeds to the next time step and observes the next state St+1 p( |St, At). The reward rate of policy π is defined as rπ .= Climt E[r(St, At) | π, S0], (1) where C-lim T z T .= lim T 1 T +1 PT i=0 zi is the Cesaro limit. The Cesaro limit in (1) is assumed to exist and is independent of S0. The most general assumption that guarantees these is the following one: Assumption 2.1. Policy π induces a unichain. The action-value function in the average-reward setting is known as the differential actionvalue function and is defined as qπ(s, a) .= Clim T PT t=0 E[r(St, At) rπ | S0 = s, A0 = a]. Note that if a stronger ergodic chain assumption is used instead, the Cesaro limit in defining rπ and qπ is equivalent to the normal limit. The action-value Bellman equation is q = r r1 + Pπq, (2) where q R|S||A| and r R are free variables and Pπ R|S||A| |S||A| is the transition matrix, that is, Pπ((s, a), (s , a )) .= p(s |s, a)π(a |s ). It is well-known (Puterman, 1994) that r = rπ is the unique solution for r and all the solutions for q form a set {qπ + c1 : c R}. In this paper, we consider a special off-policy learning setting, where the agent learns from i.i.d. samples drawn from a given sampling distribution. In particular, at the k-th iteration, the agent draws a sample (Sk, Ak, Rk, S k, A k) from a given sampling distribution dµπ. Distribution dµπ can be any distribution satisfying Assumption 2.2. Rk = r(Sk, Ak), S k p( |Sk, Ak), A k π( |S k), and dµ(s, a) > 0 for all (s, a), where dµ(s, a) denotes the marginal distribution of (Sk, Ak). The last part of Assumption 2.2 means that every state-action pair is possible to be sampled. This is a necessary condition for learning the differential value function accurately for all state-action pairs. In the rest of the paper, the expectation E is taken w.r.t. dµπ. If no sampling distribution is given, one could instead draw samples in the following way. First randomly sample (Sk, Ak, Rk, S k) from a batch of transitions collected by one or multiple agents, with all agents following possibly different unknown policies in the same MDP. Then sample A k π( | S k). Assuming that the number of all state-action pairs in the batch grows to infinity as the batch size grows to infinity then sampling from the batch is approximately equivalent to sampling from some distribution satisfying Assumption 2.2. Our goal is to approximate, using the data generated from dµπ, both the reward rate and the differential value function. The reward rate rπ is approximated by a learnable Average-Reward Off-Policy Policy Evaluation with Function Approximation scalar ˆr. The differential value function qπ is only approximated up to a constant. That is, we are only interested in approximating qπ + c1 for some c R. This is sufficient if the approximated value function is only used for policy improvement in a control algorithm. However, when the state and/or action spaces are large, function approximation may be necessary to represent qπ +c1. This paper mainly considers linear function approximation, where the agent is given a feature mapping x that generates a K-dimensional vector x(s, a) given a state-action pair (s, a). The agent further maintains a learnable weight vector w RK and adjusts it to approximate, for all (s, a), qπ(s, a) + c using x(s, a) w. Let X R|S||A| K be the feature matrix whose (s, a) row is x(s, a) . For the uniqueness of the solution for w, it is common to make the following assumption: Assumption 2.3. X has linearly independent columns. 3. Differential Semi-Gradient Q Evaluation We first present Differential Semi-gradient Q Evaluation (Diff-SGQ), which is a straightforward extension of the tabular off-policy Differential TD-learning algorithm (Wan et al., 2020) to linear function approximation. At the k-th iteration, the algorithm draws a sample (Sk, Ak, Rk, S k, A k) from dµπ and updates wk and ˆrk as wk+1 .= wk + αkδk(wk, ˆrk)xk, (3) ˆrk+1 .= ˆrk + αkδk(wk, ˆrk), (4) where αk is the stepsize used at k-th iteration, xk .= x(Sk, Ak), x k .= x(S k, A k), and δk(w, ˆr) .= Rk ˆr + x k w x k w is the temporal difference error. From (2), it is easy to see rπ = d (r + Pπqπ qπ) holds for any probability distribution d; in particular, it holds for d = dµ, which is the intuition behind the ˆr update (4).Diff-SGQ iteratively solves E[δk(w, ˆr)xk] = 0, and E[δk(w, ˆr)] = 0, (5) whose solutions, if they exist, are TD fixed points. A TD fixed point is an approximate solution to (2) using linear function approximation. We consider the quality of the approximation in the next section. All the proposed algorithms in this paper aim to find a TD fixed point up to some regularization bias if necessary. In general, there could be no TD fixed point, one TD fixed point, or infinitely many TD fixed points, as in the discounted setting. To see this, let yk .= [1, x k ] , y k .= [1, x k ] , u .= [ˆr, w ] , and e1 .= [1, 0, , 0] RK+1. Then combining (3) and (4) gives E[δk(u)yk] = 0, (6) where δk(u) .= Rk e 1 u + y k u y k u. Writing (6) in vector form, we have Au + b = 0, where A .= E[yk( e1 + y k yk) ] = Y D(Pπ I)Y Y dµe 1 = 1 1 D(Pπ I)X X dµ X D(Pπ I)X b .= E[yk Rk] = Y Dr, Y .= [1, X], D .= diag(dµ). If and only if A is invertible, there exists a unique TD fixed point u TD .= A 1b. (7) Otherwise, there is either no TD fixed point or there are infinitely many. Unfortunately, even if there exists a unique TD fixed point, Diff-SGQ can still diverge, which exemplifies the deadly triad (Sutton & Barto, 2018) in the average-reward setting. The following example confirms this point. Example 1 (The divergence of Diff-SGQ). Consider a two-state MDP (Figure 1). The expected Diff-SGQ up- date per step can be written as ˆrk+1 wk+1 + b = ˆrk wk + α 1 6 2 6 consider α a constant stepsize. The eigenvalues of A = 1 6 2 6 are both positive. Hence, no matter what posi- tive stepsize is picked, the expected update diverges. The sample updates (3) and (4) using standard stochastic approximation stepsizes, therefore, also diverge. Furthermore, because both eigenvalues are positive, A is an invertible matrix, implying the unique existence of the TD fixed-point. Figure 1. An example showing the divergence of Diff-SGQ. 4. One-Stage Differential Gradient Q Evaluation We now present an algorithm that is guaranteed to converge to the TD fixed point (6) if it uniquely exists. Motivated by the Mean Squared Projected Bellman Error (MSPBE) defined in the discounted setting and used by Gradient TD algorithms, we define the MSPBE in the average-reward Average-Reward Off-Policy Policy Evaluation with Function Approximation MSPBE1(u) = ΠY δ(u) 2 D, (8) where ΠY .= Y (Y DY ) 1Y D is the projection matrix and δ(u) .= r e 1 u1 + PπY u Y u is the vector of TD errors for all state-action pairs. The vector ΠY δ(u) is the projection of the vector of TD errors on the column space of Y . The existence of the matrix inverse in ΠY , (Y DY ) 1, is guaranteed by Assumption 2.2 and Assumption 4.1. For any w RK and c R, Xw = c1. The above assumption guarantees that if w is a solution for w in (5), then no other solution s approximated action-value function would be identical to Xw up to a constant. This assumption is also used by Tsitsiklis & Van Roy (1999) in their on-policy policy evaluation algorithms in averagereward MDPs. Apparently the assumption does not hold in the tabular setting (i.e., when X = I). However, with function approximation, we usually have many more states than features (i.e., |S| K), in which case the above assumption would not be restrictive. Let C .= Y DY , we have Π DΠ = DY C 1Y D, with which we give a different form for (8): MSPBE1(u) = Y D δ(u) 2 C 1 = Au + b 2 C 1 (9) = E[δk(u)yk] E[yky k ] 1E[δk(u)yk]. It can be seen that if (6) has a solution, then that solution also minimizes (9), in which case solving (6) can be converted to minimizing (9). However, when (6) does not have a unique solution, the set of minimizers of (9) could be unbounded and thus algorithms minimizing MSPBE1 risk generating unbounded updates. To ensure the stability of our algorithm when (6) does not have a unique solution, we use a regularized MSPBE1 as our objective: J1,η(u) .= Au + b 2 C 1 + ηu I0u, where I0 .= diag(1 e1), η is a positive scalar, and ηu I0u = η w 2 is a ridge regularization term on w. To minimize J1,η(u), one could proceed with techniques used in TDC (Sutton et al., 2009a), which we leave for future work. In this paper, we proceed with the saddle-point formulation of GTD2 introduced by Liu et al. (2015), which exploits Fenchel s duality: u M 1u = max ν 2u ν ν Mν, for any positive definite M, yielding J1,η(u) (10) =maxν RK+1 2ν Y D δ(u) ν Cν + ηu I0u. So minu J1,η(u) = minu maxν J1,η(u, ν), where J1,η(u, ν) .= 2ν Y D δ(u) ν Cν + ηu I0u. As J1,η(u, ν) is convex in u and concave in ν, we have now reduced the problem into a convex-concave saddle point problem. Applying primal-dual methods to this problem, that is, performing gradient ascent for ν following νJ1,η(u, ν) and gradient descent for u following u J1,η(u, ν), we arrive at our first new algorithm, One Stage Differential Gradient Q Evaluation, or Diff-GQ1. At the k-th iteration, with a sample (Sk, Ak, Rk, S k, A k) from dµπ, Diff-GQ1 updates uk and νk as δk .= Rk e 1 uk + y k uk y k uk, (11) νk+1 .= νk + αk(δk y k νk)yk, uk+1 .= uk + αk(yk y k + e1)y k νk αkηI0uk, where {αk} is the sequence of learning rates satisfying the following standard assumption: Assumption 4.2. {αk} is a positive deterministic nonincreasing sequence s.t. P k αk = and P k α2 k < . The algorithm is one-stage because, while there are two weight vectors updated in every iteration, both converge simultaneously. Theorem 1. If Assumptions 2.1, 2.2, 4.1, & 4.2 hold, then for any η > 0, almost surely, the iterate {uk} generated by Diff-GQ1 (11) converges to u η, where u η .= (ηI0 + A C 1A) 1A C 1b is the unique minimizer of J1,η(u). Further, if A is invertible, then for η = 0, {uk} converges almost surely to the u TD defined in (7). We defer the full proof to Section A.1. Proof. (Sketch) With κk .= [ν k , u k ] , we rewrite (11) as κk+1 .= κk + αk(Gk+1κk + hk+1), Gk+1 .= yky k yk(y k yk) yke 1 (yk y k)y k + e1y k ηI0 hk+1 .= ykrk 0 The asymptotic behavior of {κk} is governed by G .= E[Gk+1] = C A A ηI0 , h .= E[hk+1] = b 0 The convergence of κt to a unique point can be guaranteed if G is a Hurwitz matrix, or equivalently, if the real part of any eigenvalue of G is strictly negative. Therefore, it is Average-Reward Off-Policy Policy Evaluation with Function Approximation important to first ensure that G is nonsingular. If A was nonsingular, we can show G being nonsingular easily even with η = 0. However, in general, A may not be nonsingular and therefore, we require η > 0 to ensure G being nonsingular. We can easily show that the real part of any eigenvalue of G is strictly negative and thus G is Hurwitz. Standard stochastic approximation results (Borkar, 2009) then show limk κk = G 1 h. Define u η as the lower half of G 1 h, we have u η = (ηI0 + A C 1A) 1A C 1b. It is easy to verify (e.g., using the first order optimality condition of J1,η(u)) that u η is the unique minimizer of J1,η(u). Quality of TD Fixed Points. We now analyze the quality of TD fixed points. For our analysis, we make the following assumption. Assumption 4.3. There exists at least one TD fixed point. Let u = [ˆr , w ] be one fixed point (a solution of (6)). We are interested in the upper bound of the absolute value of the difference between the estimated reward rate and the true reward rate |ˆr rπ| and also the upper bound of the minimum distance between the estimated differential value function to the set {qπ + c1}. In general, as long as there is representation error, the TD fixed point can be arbitrarily poor in terms of approximating the value function, even in the discounted case (see Kolter (2011) for more discussion). In light of this, we study the bounds only when dµ is close to dπ, the stationary state-action distribution of π, in the sense of the following assumption. Let ξ (0, 1) be a constant, Assumption 4.4. F is positive semidefinite, where F .= X DX X DPπX X P π DX ξ2X DX A similar assumption about F is also used by Kolter (2011) in the analysis of the performance of the MSPBE minimizer in the discounted setting. Kolter (2011) uses ξ = 1 while we use ξ < 1 to account for the lack of discounting. In Section D.1, we show with simulation that this assumption holds with reasonable probability in our randomly generated MDPs. Furthermore, we consider the bounds when all the features have zero mean under the distribution dµ. Assumption 4.5. X dµ = 0. This can easily be done by subtracting each feature vector sampled in our learning algorithm by some estimated mean feature vector, which is the empirical average of all the feature vectors sampled from dµ. Note without this meancentered feature assumption, a looser bound can also be obtained. Our intention here is to show that bounds of our algorithms are on par with their counterparts in the discounted setting and thus one does not lose these bounds when one moves from the discounted setting to the averagereward setting. Proposition 1. Under Assumptions 2.1, 2.2, 4.1, 4.3 - 4.5, inf c R Xw qc π D Pπ D + 1 1 ξ inf c R ΠXqc π qc π D, d µ (Pπ I) D 1( Pπ D+1) 1 ξ infc R ΠXqc π qc π D, where qc π .= qπ + c1 and ΠX = X(X DX) 1X D. We defer the proof to Section A.2. As a special case, there exists a unique TD fixed point in the on-policy case (i.e., dµ = dπ) under Assumptions 2.1, 2.3, and 4.1. Then |rπ ˆr | = 0 as d π (Pπ I) = 0 and a tighter bound for the estimated differential value function can be obtained. See Tsitsiklis & Van Roy (1999) for details. Finite Sample Analysis. We now provide finite sample analysis for a variant of Diff-GQ1, Projected Diff-GQ1, which is detailed in Section A.3 in the appendix. Projected Diff-GQ1 is different from Diff-GQ1 in three ways: 1) for each iteration, Projected Diff-GQ1 projects the two updated weight vectors to two bounded closed sets to ensure that the weight vectors do not become too large, 2) Projected Diff GQ1 uses a constant stepsize, and 3) Projected Diff-GQ1 does not impose ridge regularization, that is, it considers the objective MSPBE1 directly. Proposition 2. (Informal) Under standard assumptions, if Assumption 4.4 holds and A is nonsingular, with proper stepsizes, with high probability, the iterates {ˆrk}, {wk} generated by Projected Diff-GQ1 satisfy ( ˆrk rπ)2 = O(infc R X wk qc π 2) 2 ) + O(infc R ΠXqc π qc π 2 D), where ˆrk .= (1/k) Pk i=1 ˆri, wk .= (1/k) Pk i=1 wi. We defer the precise statement and its proof to Section A.3. 5. Two-Stage Differential Gradient Q Evaluation While Assumption 4.1 is not restrictive, we present in this section a new algorithm that does not require it but can still converge to the TD fixed point if it uniquely exists. The algorithm achieves this by drawing one more sample from dµπ for each iteration, and performing two learning stages, where ˆr converges only when w has converged. We call this algorithm Two-Stage Differential Gradient Q Evaluation, or Diff-GQ2, and derive it as follows. Consider the TD fixed point (5). Writing E[δk(w, ˆr)] = 0 in vector form, we have ˆr = d µ (r + PπXw Xw). (12) Average-Reward Off-Policy Policy Evaluation with Function Approximation Replacing ˆr in E[δk(w, ˆr)xk] = 0 with (12), we have: X D(r + PπXw Xw) X D1d µ (r + PπXw Xw) = 0, or equivalently A2w + b2 = 0, (13) where A2 .= X (D dµd µ )(Pπ I)X, b2 .= X (D dµd µ )r. The combination of (12) and (13) is an alternative definition for TD fixed points. When A2 is invertible, the unique TD fixed points are w TD .= A 1 2 b2, (14) ˆr TD .= d µ (r + PπXw TD Xw TD). It is easy to verify that u TD = [ˆr TD, w TD] , where u TD is defined in (7). Denote rw .= r + PπXw Xw, then (13) can be written as X D( rw d µ rw1) = 0, from which we define a new MSPBE objective: MSPBE2(w) .= ΠX( rw d µ rw1)) 2 where C2 .= X DX in ΠX is invertible under Assumption 2.2. MSPBE2 is different from MSPBE1 defined in (9) in that MSPBE2 is a function of w only while MSPBE1 is a function of both w and ˆr. However, the solutions of MSPBE2(w) = 0 are exactly the solutions of w in MSPBE1(u) = 0, if both solutions exist. After introducing a ridge term with η > 0 for the same reason as Diff-GQ1, we arrive at the objective that Diff GQ2 minimizes: J2,η(w) .= X D( rw d µ rw1) 2 C 1 2 + η w 2. Applying Fenchel s duality on J2,η(w) yields minw J2,η(w) = minw maxν J2,η(w, ν), where J2,η(w, ν) .=2ν X D( rw d µ rw1) ν C2ν + η w 2. J2,η(w, ν) is convex in w and concave in ν. To apply primal-dual methods for finding the saddle point of J2,η(w, ν), we need to obtain unbiased samples of X D( rw d µ rw1). As this term includes two nested expectations (i.e., D and dµ), Diff-GQ2 requires two i.i.d. samples (Sk,1, Ak,1, Rk,1, S k,1, A k,1) and (Sk,2, Ak,2, Rk,2, S k,2, A k,2) from dµπ at the k-th iteration for a single gradient update. This is not the notorious double sampling issue in minimizing the Mean Square Bellman Error (see, e.g., Baird (1995) and Section 11.5 by Sutton & Barto (2018)), where two successor states s 1 and s 2 from a single state action pair (s, a) are required, which is not possible in the function approximation setting. Sampling two i.i.d. tuples from dµπ is completely feasible. At the k-th iteration, Diff-GQ2 updates ν and w as νk+1 .= νk + αk Rk,1 + x k,1wk x k,1wk (15) (Rk,2 + x k,2wk x k,2wk) x k,1νt xk,1, wk+1 .= wk + αk xk,1 x k,1 + (xk,2 x k,2) x k,1νk where xk,i .= x(Sk,i, Ak,i), x k,i .= x(S k,i, A k,i), {αk}, again, satisfies Assumption 4.2. Additionally, following (12), Diff-GQ2 updates ˆr as ˆrk+1 .= ˆrk + βk 1 2 P2 i=1(Rk,i + x k,iwk x k,iwk) where {βk} satisfies the same assumption as {αk}. Assumption 5.1. {βk} is a positive deterministic nonincreasing sequence s.t. P k βk = and P Theorem 2. If Assumptions 2.1, 2.2, 2.3, 4.2, & 5.1 hold, then almost surely, the iterates {wk}, {ˆrk} generated by Diff-GQ2 (15) & (16) satisfy lim k wk = w η, lim k ˆrk = d µ (r + PπXw η Xw η), where w η .= (ηI+A 2 C 1 2 A2) 1A 2 C 1 2 b2 is the unique minimizer of J2,η(w). Define w 0 .= limη 0 w η, we have w η w 0 ηU0 for some constant U0. Further, if Assumption 4.3 holds, then A2w 0 + b2 = 0, and if A2 is invertible, then for η = 0, wk and ˆrk converge almost surely to w TD and ˆr TD defined in (14). We defer the full proof to Section A.4. Similar to Projected Diff-GQ1, we provide a finite sample analysis for a variant of Diff-GQ2, Projected Diff-GQ2, in Section A.5. 6. Experiments In light of the reproducibility challenge in RL research (Henderson et al., 2017), we perform a grid search with 30 independent runs for hyperparameter tuning in all our experiments. Each curve corresponds to the best hyperparameters minimizing the error of the reward rate prediction at the end of training and is averaged over 30 independent runs with the shaded region indicating one standard deviation. To the best of our knowledge, Gradient DICE is the Average-Reward Off-Policy Policy Evaluation with Function Approximation π0 = 0.1, µ0 = 0.1 0 5 103 0.0 π0 = 0.3, µ0 = 0.3 0 5 103 0.0 π0 = 0.5, µ0 = 0.5 Gradient DICE Diff-SGQ Diff-GQ1 Diff-GQ2 0 5 103 0.0 π0 = 0.7, µ0 = 0.7 0 5 103 0.0 π0 = 0.9, µ0 = 0.9 π0 = 0.1, µ0 = 0.5 0 5 103 0.0 π0 = 0.3, µ0 = 0.5 0 5 103 0.0 π0 = 0.7, µ0 = 0.5 0 5 103 0.0 π0 = 0.9, µ0 = 0.5 0 5 103 Steps π0 = 0.1, µ0 = 0.9 0 5 103 Steps 2.0 π0 = 0.3, µ0 = 0.7 0 5 103 Steps π0 = 0.7, µ0 = 0.3 0 5 103 Steps π0 = 0.9, µ0 = 0.1 Figure 2. Boyan s chain with linear function approximation. We vary π0 in {0.1, 0.3, 0.5, 0.7, 0.9}. In the first row, we use µ0 = π0; in the second row, we use µ0 = 0.5; in the third row, we use µ0 = 1 π0. ˆr is the average ˆr of recent 100 steps. Figure 3. A variant of Boyan s chain for policy evaluation in the average-reward setting. There are 13 states {s0, . . . , s12} and two actions {a0, a1} in the chain. For any i {0, . . . , 12}, r(si, a0) = 1 and r(si, a1) = 2. For any i 2, p(si 2|si, a0) = 1 and p(si 1|si, a1) = 1. At s1, both actions lead to s0. At s0, both actions lead to a random state in {s0, . . . , s12} with equal probability. only density-ratio-based approach for off-policy policy evaluation in average-reward MDPs that is provably convergent with general linear function approximation and has O(K) computational complexity per step. We, therefore, use Gradient DICE as a baseline. See Section B.1 for more details about Gradient DICE. All the implementations are publicly available. 1 Linear Function Approximation. We benchmark Diff SGQ, Diff-GQ1, Diff-GQ2, and Gradient DICE in a variant of Boyan s chain (Boyan, 1999), which is the same as the chain used in Zhang et al. (2020b) except that we introduce a nonzero reward for each action for the purpose of policy evaluation. The chain is illustrated in Figure 3. We consider target policies of the form π(a0|si) = π0 for all si, where 1https://github.com/Shangtong Zhang/Deep RL π0 [0, 1] is some constant. The sampling distribution we consider has the form dµ(si, a0) = µ0 13 and dµ(si, a0) = 1 µ0 13 for all si, where µ0 [0, 1] is some constant. Even if µ0 = π0, the problem is still off-policy. We consider linear function approximation and use the same state features as Boyan (1999), which are detailed in Section C. We use an one-hot encoding for actions. Concatenating the state feature and the one-hot action feature yields the state-action feature we use in the experiments. We use constant learning rates α for all compared algorithms, which is tuned in 2 20, 2 19, . . . , 2 1 . For Diff GQ1 and Diff-GQ2, besides tuning α in the same way as Diff-SGQ, we tune η in {0, 0.01, 0.1}. For Gradient DICE, besides tuning (α, η) in the same way as Diff-GQ1, we tune λ, the weight for a normalizing term, in {0, 0.1, 1, 10}. We run each algorithm for 5 103 steps. Diff-GQ2 updates are applied every two steps as one Diff-GQ2 update requires two samples. The results in Figure 2 suggest that the three differential-value-based algorithms proposed in this paper consistently outperform the density-ratio-based algorithm Gradient DICE in the tested domain. Nonlinear Function Approximation. The value-based offpolicy policy evaluation algorithms proposed in this paper can be easily combined with neural network function approximators. For Diff-SGQ, we use a target network (Mnih et al., 2015) to stabilize the training of neural networks. For Diff-GQ1 and Diff-GQ2, we introduce neural network function approximators in the saddle-point objectives (i.e., Average-Reward Off-Policy Policy Evaluation with Function Approximation 0 103 Steps Half Cheetah (σ = 0.9) Gradient DICE Diff-SGQ Diff-GQ1 Diff-GQ2 0 103 Steps Walker2d (σ = 0.9) 0 103 Steps 4 Hopper (σ = 0.9) 0 103 Steps Swimmer (σ = 0.9) Figure 4. Mu Jo Co tasks with with neural network function approximation. ˆr is the average ˆr of recent 100 steps. J1,η(u, ν) and J2,η(w, ν)) directly, similar to Zhang et al. (2020b) in Gradient DICE. The details are provided Sections B.2, B.3, and B.4. We benchmark Diff-SGQ, Diff-GQ1, Diff-GQ2, and Gradient DICE in several Mu Jo Co domains. To this end, we first train a deterministic target policy π0 with TD3 (Fujimoto et al., 2018). The behavior policy µ0 is composed by introducing Gaussian noise to π0, i.e., µ0(a|s) .= N(π0(s), σ2I). The ground truth reward rate of π0 is computed with Monte Carlo methods by running π0 for 106 steps. We vary σ from {0.1, 0.5, 0.9}. More details are provided in Section C. For Differential FQE, Diff-GQ1, and Diff-GQ2, we tune the learning rate from {0.1, 0.05, 0.01, 0.005, 0.001}. For Gradient DICE, we additionally tune λ from {0.1, 1, 10}. The results with σ = 0.9 are reported in Figure 4, where Diff-GQ1 consistently performs the best. The results with σ = 0.1 and σ = 0.5 are deferred to Section D.2, where Diff-GQ1 consistently performs the best as well. 7. Related Work and Discussion In this paper, we addressed the policy evaluation problem with function approximation in the model-free setting. If the model is given or learned by the agent, such a problem could be solved by, for example, classic approximate dynamic programming approaches (Powell, 2007), search algorithms (Russell & Norvig, 2002), and other optimal control algorithms (Kirk, 2004). For more discussion about learning a model, see, for example, Sutton (1990); Sutton et al. (2012); Liu et al. (2018b); Chua et al. (2018); Wan et al. (2019); Gottesman et al. (2019); Yu et al. (2020); Kidambi et al. (2020). The on-policy average-reward policy evaluation problem was studied by Tsitsiklis & Van Roy (1999), which proposed and solved a Projected Bellman Equation (PBE). The reward rate in PBE is a known quantity, which is trivial to estimate in the on-policy case. The reward rate, however, cannot be obtained easily in the off-policy case and needs to be estimated cleverly. Such a challenge is resolved in our work by optimizing a novel objective, MSPBE1, which has the reward rate estimate as a free variable to optimize. Moreover, by proposing the other novel objective MSPBE2, we showed that the reward rate or its direct estimate does not even have to appear in an objective. In fact, for the uniqueness of the solution, our algorithms did not optimize MSPBE1 and MSPBE2, but optimized a regularized version of these objectives, where the weight of the regularization term can be arbitrarily small. Introducing a regularization term in MSPBE-like objectives is not new though; see, for example, Mahadevan et al. (2014); Yu (2017); Du et al. (2017); Zhang et al. (2020d;b). One could, of course, apply regularization to Diff-SGQ directly, similar to Diddigi et al. (2019) in the discounted off-policy linear TD. Unfortunately, the weight for their regularization term has to be sufficiently large to ensure convergence. Fenchel s duality, which we used in the derivation of our algorithms, is not new in RL research. For example, it has been applied to cope with the double sampling problem (see, e.g., Liu et al. (2015); Macua et al. (2014); Dai et al. (2017); Xie et al. (2018); Nachum et al. (2019a;b); Zhang et al. (2020a;b)) or to construct novel policy iteration frameworks (Zhang et al., 2020c). 8. Conclusion In this paper, we provided the first study of the off-policy policy evaluation problem (estimating both reward rate and differential value function) in the function approximation, average-reward setting. Such a problem encapsules the existing off-policy evaluation problem (estimating only the reward rate; see, e.g., Li (2019)). To this end, we proposed two novel MSPBE objectives and derived two algorithms optimizing regularized versions of these objectives. The proposed algorithms are the first provably convergent algorithms for estimating the differential value function and are also the first provably convergent algorithms for estimating the reward rate without estimating density ratio in this setting. In terms of estimating the reward rate, though our goal is not to achieve new state of the art, our empirical results confirmed that the proposed value-based methods consistently outperform a competitive density-ratio-based method in tested domains. We conjecture that this performance ad- Average-Reward Off-Policy Policy Evaluation with Function Approximation vantage results from the flexibility of value-based methods, that is, for any c, qπ + c1 is a feasible learning target. By contrast, the density ratio dπ(s,a) dµ(s,a) is unique. Overall, our empirical study suggests that value-based methods deserve more attention in future research on off-policy evaluation in average-reward MDPs. Acknowledgments SZ is generously funded by the Engineering and Physical Sciences Research Council (EPSRC). This project has received funding from the European Research Council under the European Union s Horizon 2020 research and innovation programme (grant agreement number 637713). The experiments were made possible by a generous equipment grant from NVIDIA. Abbasi-Yadkori, Y., Bartlett, P., Bhatia, K., Lazic, N., Szepesvari, C., and Weisz, G. Politex: Regret bounds for policy iteration using expert prediction. In International Conference on Machine Learning, 2019. Baird, L. Residual algorithms: Reinforcement learning with function approximation. Machine Learning, 1995. Borkar, V. S. Stochastic approximation: a dynamical systems viewpoint. Springer, 2009. Boyan, J. A. Least-squares temporal difference learning. In Proceedings of the 16th International Conference on Machine Learning, 1999. Chua, K., Calandra, R., Mc Allister, R., and Levine, S. Deep reinforcement learning in a handful of trials using probabilistic dynamics models. In Advances in Neural Information Processing Systems, 2018. Dai, B., Shaw, A., Li, L., Xiao, L., He, N., Liu, Z., Chen, J., and Song, L. Sbeed: Convergent reinforcement learning with nonlinear function approximation. ar Xiv preprint ar Xiv:1712.10285, 2017. Diddigi, R. B., Kamanchi, C., and Bhatnagar, S. A convergent off-policy temporal difference algorithm. ar Xiv preprint ar Xiv:1911.05697, 2019. Du, S. S., Chen, J., Li, L., Xiao, L., and Zhou, D. Stochastic variance reduction methods for policy evaluation. In Proceedings of the 34th International Conference on Machine Learning, 2017. Dulac-Arnold, G., Mankowitz, D., and Hester, T. Challenges of real-world reinforcement learning. ar Xiv preprint ar Xiv:1904.12901, 2019. Fujimoto, S., van Hoof, H., and Meger, D. Addressing function approximation error in actor-critic methods. ar Xiv preprint ar Xiv:1802.09477, 2018. Gelada, C. and Bellemare, M. G. Off-policy deep reinforcement learning by bootstrapping the covariate shift. In Proceedings of the 33rd AAAI Conference on Artificial Intelligence, 2019. Gottesman, O., Liu, Y., Sussex, S., Brunskill, E., and Doshi-Velez, F. Combining parametric and nonparametric models for off-policy evaluation. ar Xiv preprint ar Xiv:1905.05787, 2019. Hallak, A. and Mannor, S. Consistent on-line off-policy evaluation. In Proceedings of the 34th International Conference on Machine Learning, 2017. Henderson, P., Islam, R., Bachman, P., Pineau, J., Precup, D., and Meger, D. Deep reinforcement learning that matters. ar Xiv preprint ar Xiv:1709.06560, 2017. Howard, R. A. Dynamic programming and markov processes. 1960. Jiang, N. and Li, L. Doubly robust off-policy value evaluation for reinforcement learning. ar Xiv preprint ar Xiv:1511.03722, 2015. Kidambi, R., Rajeswaran, A., Netrapalli, P., and Joachims, T. Morel: Model-based offline reinforcement learning. ar Xiv preprint ar Xiv:2005.05951, 2020. Kirk, D. E. Optimal control theory: an introduction. Courier Corporation, 2004. Kolter, J. Z. The fixed points of off-policy td. In Advances in Neural Information Processing Systems, 2011. Konda, V. R. Actor-critic algorithms. Ph D thesis, Massachusetts Institute of Technology, 2002. Lazic, N., Yin, D., Farajtabar, M., Levine, N., Gorur, D., Harris, C., and Schuurmans, D. A maximum-entropy approach to off-policy evaluation in average-reward mdps. Advances in Neural Information Processing Systems, 2020. Li, L. A perspective on off-policy evaluation in reinforcement learning. Frontiers of Computer Science, 2019. Liu, B., Liu, J., Ghavamzadeh, M., Mahadevan, S., and Petrik, M. Finite-sample analysis of proximal gradient td algorithms. In Proceedings of the 31st Conference on Uncertainty in Artificial Intelligence, 2015. Liu, Q., Li, L., Tang, Z., and Zhou, D. Breaking the curse of horizon: Infinite-horizon off-policy estimation. In Advances in Neural Information Processing Systems, 2018a. Average-Reward Off-Policy Policy Evaluation with Function Approximation Liu, Y., Gottesman, O., Raghu, A., Komorowski, M., Faisal, A. A., Doshi-Velez, F., and Brunskill, E. Representation balancing mdps for off-policy policy evaluation. Advances in Neural Information Processing Systems, 2018b. Liu, Y., Swaminathan, A., Agarwal, A., and Brunskill, E. Off-policy policy gradient with state distribution correction. ar Xiv preprint ar Xiv:1904.08473, 2019. Macua, S. V., Chen, J., Zazo, S., and Sayed, A. H. Distributed policy evaluation under multiple behavior strategies. IEEE Transactions on Automatic Control, 2014. Mahadevan, S., Liu, B., Thomas, P., Dabney, W., Giguere, S., Jacek, N., Gemp, I., and Liu, J. Proximal reinforcement learning: A new theory of sequential decision making in primal-dual spaces. ar Xiv preprint ar Xiv:1405.6757, 2014. Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., Graves, A., Riedmiller, M., Fidjeland, A. K., Ostrovski, G., et al. Human-level control through deep reinforcement learning. Nature, 2015. Mousavi, A., Li, L., Liu, Q., and Zhou, D. Black-box off-policy estimation for infinite-horizon reinforcement learning. In International Conference on Learning Representations, 2020. Nachum, O., Chow, Y., Dai, B., and Li, L. Dualdice: Behavior-agnostic estimation of discounted stationary distribution corrections. ar Xiv preprint ar Xiv:1906.04733, 2019a. Nachum, O., Dai, B., Kostrikov, I., Chow, Y., Li, L., and Schuurmans, D. Algaedice: Policy gradient from arbitrary experience. ar Xiv preprint ar Xiv:1912.02074, 2019b. Nair, V. and Hinton, G. E. Rectified linear units improve restricted boltzmann machines. In Proceedings of the 27th International Conference on Machine Learning, 2010. Powell, W. B. Approximate Dynamic Programming: Solving the curses of dimensionality, volume 703. John Wiley & Sons, 2007. Puterman, M. L. Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons, 1994. Russell, S. and Norvig, P. Artificial intelligence: a modern approach. 2002. Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al. Mastering the game of go with deep neural networks and tree search. Nature, 2016. Sutton, R. S. Integrated architectures for learning, planning, and reacting based on approximating dynamic programming. In Proceedings of the 7th International Conference on Machine Learning, 1990. Sutton, R. S. and Barto, A. G. Reinforcement Learning: An Introduction (2nd Edition). MIT press, 2018. Sutton, R. S., Maei, H. R., Precup, D., Bhatnagar, S., Silver, D., Szepesv ari, C., and Wiewiora, E. Fast gradientdescent methods for temporal-difference learning with linear function approximation. In Proceedings of the 26th International Conference on Machine Learning, 2009a. Sutton, R. S., Maei, H. R., and Szepesv ari, C. A convergent o(n) temporal-difference algorithm for off-policy learning with linear function approximation. In Advances in Neural Information Processing Systems, 2009b. Sutton, R. S., Szepesv ari, C., Geramifard, A., and Bowling, M. P. Dyna-style planning with linear function approximation and prioritized sweeping. ar Xiv preprint ar Xiv:1206.3285, 2012. Sutton, R. S., Mahmood, A. R., and White, M. An emphatic approach to the problem of off-policy temporal-difference learning. The Journal of Machine Learning Research, 2016. Tang, Z., Feng, Y., Li, L., Zhou, D., and Liu, Q. Doubly robust bias reduction in infinite horizon off-policy estimation. ar Xiv preprint ar Xiv:1910.07186, 2019. Thomas, P. S., Theocharous, G., and Ghavamzadeh, M. High-confidence off-policy evaluation. In Twenty-Ninth AAAI Conference on Artificial Intelligence, 2015. Tsitsiklis, J. N. and Van Roy, B. Average cost temporaldifference learning. Automatica, 35(11):1799 1808, 1999. Uehara, M. and Jiang, N. Minimax weight and qfunction learning for off-policy evaluation. ar Xiv preprint ar Xiv:1910.12809, 2019. Wan, Y., Zaheer, M., White, A., White, M., and Sutton, R. S. Planning with expectation models. ar Xiv preprint ar Xiv:1904.01191, 2019. Wan, Y., Naik, A., and Sutton, R. S. Learning and planning in average-reward markov decision processes. ar Xiv preprint ar Xiv:2006.16318, 2020. Xie, T., Liu, B., Xu, Y., Ghavamzadeh, M., Chow, Y., Lyu, D., and Yoon, D. A block coordinate ascent algorithm for mean-variance optimization. In Advances in Neural Information Processing Systems, 2018. Average-Reward Off-Policy Policy Evaluation with Function Approximation Xie, T., Ma, Y., and Wang, Y.-X. Towards optimal off-policy evaluation for reinforcement learning with marginalized importance sampling. In Advances in Neural Information Processing Systems, 2019. Yu, H. On convergence of some gradient-based temporaldifferences algorithms for off-policy learning. ar Xiv preprint ar Xiv:1712.09652, 2017. Yu, H. and Bertsekas, D. P. Convergence results for some temporal difference methods based on least squares. IEEE Transactions on Automatic Control, 2009. Yu, T., Thomas, G., Yu, L., Ermon, S., Zou, J., Levine, S., Finn, C., and Ma, T. Mopo: Model-based offline policy optimization. ar Xiv preprint ar Xiv:2005.13239, 2020. Zhang, R., Dai, B., Li, L., and Schuurmans, D. Gendice: Generalized offline estimation of stationary values. In International Conference on Learning Representations, 2020a. Zhang, S., Liu, B., and Whiteson, S. Gradient DICE: Rethinking generalized offline estimation of stationary values. In Proceedings of the 37th International Conference on Machine Learning, 2020b. Zhang, S., Liu, B., and Whiteson, S. Mean-variance policy iteration for risk-averse reinforcement learning. In Proceedings of the 35th AAAI Conference on Artificial Intelligence, 2020c. Zhang, S., Liu, B., Yao, H., and Whiteson, S. Provably convergent two-timescale off-policy actor-critic with function approximation. In Proceedings of the 37th International Conference on Machine Learning, 2020d.