# offpolicy_evaluation_via_the_regularized_lagrangian__d5de0a8e.pdf

Off-Policy Evaluation via the Regularized Lagrangian

Mengjiao Yang 1 Oﬁr Nachum 1 Bo Dai 1 Lihong Li1 Dale Schuurmans1,2

1Google Research 2University of Alberta

The recently proposed distribution correction estimation (DICE) family of estimators has advanced the state of the art in off-policy evaluation from behavior-agnostic data. While these estimators all perform some form of stationary distribution correction, they arise from different derivations and objective functions. In this paper, we unify these estimators as regularized Lagrangians of the same linear program. The uniﬁcation allows us to expand the space of DICE estimators to new alternatives that demonstrate improved performance. More importantly, by analyzing the expanded space of estimators both mathematically and empirically we ﬁnd that dual solutions offer greater ﬂexibility in navigating the tradeoff between optimization stability and estimation bias, and generally provide superior estimates in practice.

1 Introduction

One of the most fundamental problems in reinforcement learning (RL) is policy evaluation, where we seek to estimate the expected long-term payoff of a given target policy in a decision making environment. An important variant of this problem, off-policy evaluation (OPE) [23], is motivated by applications where deploying a policy in a live environment entails signiﬁcant cost or risk [20, 27]. To circumvent these issues, OPE attempts to estimate the value of a target policy by referring only to a dataset of experience previously gathered by other policies in the environment. Often, such logging or behavior policies are not known explicitly (e.g., the experience may come from human actors), which necessitates the use of behavior-agnostic OPE methods [21].

While behavior-agnostic OPE appears to be a daunting problem, a number of estimators have recently been developed for this scenario [21, 28, 30, 31], demonstrating impressive empirical results. Such estimators, known collectively as the DICE family for DIstribution Correction Estimation, model the ratio between the propensity of the target policy to visit particular state-action pairs relative to their likelihood of appearing in the logged data. A distribution corrector of this form can then be directly used to estimate the value of the target policy.

Although there are many commonalities between the various DICE estimators, their derivations are distinct and seemingly incompatible. For example, Dual DICE [21] is derived by a particular change-of-variables technique, whereas Gen DICE [30] observes that the substitution strategy cannot work in the average reward setting, and proposes a distinct derivation based on distribution matching. Gradient DICE [31] notes that Gen DICE exacerbates optimization difﬁculties, and proposes a variant designed for limited sampling capabilities. Despite these apparent differences in these methods, the algorithms all involve a minimax optimization that has a strikingly similar form, which suggests that there is a common connection that underlies the alternative derivations.

We show that the previous DICE formulations are all in fact equivalent to regularized Lagrangians of the same linear program (LP). This LP shares an intimate relationship with the policy evaluation problem, and has a primal form we refer to as the Q-LP and a dual form we refer to as the d-LP. The primal form has been concurrently identiﬁed and studied in the context of policy optimization [22],

Indicates equal contribution. Email: {sherryy, ofirnachum, bodai}@google.com.

34th Conference on Neural Information Processing Systems (Neur IPS 2020), Vancouver, Canada.

but we focus on the d-LP formulation for off-policy evaluation here, which we ﬁnd to have a more succinct and revealing form for this purpose. Using the d-LP, we identify a number of key choices in translating it into a stable minimax optimization problem i.e. whether to include redundant constraints, whether to regularize the primal or dual variables in addition to choices in how to translate an optimized solution into an asymptotically unbiased estimate of the policy value.2 We use this characterization to show that the known members of the DICE family are a small subset of speciﬁc choices made within a much larger, unexplored set of potential OPE methods.

To understand the consequences of the various choices, we provide a comprehensive study. First, we theoretically investigate which conﬁgurations lead to bias in the primal or dual solutions, and when this affects the ﬁnal estimates. Our analysis shows that the dual solutions offer greater ﬂexibility in stabilizing the optimization while preserving asymptotic unbiasedness, versus primal solutions. We also perform an extensive empirical evaluation of the various choices across different domains and function approximators, and identify novel conﬁgurations that improve the observed outcomes.

2 Background

We consider an inﬁnite-horizon Markov Decision Process (MDP) [24], speciﬁed by a tuple M = S, A, R, T, µ0, γ , which consists of a state space, action space, reward function, transition probability function, initial state distribution, and discount factor γ [0, 1].3 A policy π interacts with the environment starting at an initial state s0 µ0, producing a distribution π( |st) over A from which an action at is sampled and applied to the environment at each step t 0. The environment produces a scalar reward rt = R(st, at),4 and transitions to a new state st+1 T(st, at).

2.1 Policy Evaluation

The value of a policy π is deﬁned as the normalized expected per-step reward it obtains: ρ(π) := (1 γ)E [P t=0 γt R(st, at) | s0 µ0, t, at π(st), st+1 T(st, at)] . (1) In the policy evaluation setting, the policy being evaluated is referred to as the target policy. The value of a policy may be expressed in two equivalent ways: ρ(π) = (1 γ) Ea0 π(s0) s0 µ0 [Qπ(s0, a0)] = E(s,a) dπ[R(s, a)], (2)

where Qπ and dπ are the state-action values and visitations of π, respectively, which satisfy

Qπ(s, a) = R(s, a) + γ PπQπ(s, a), where PπQ(s, a) := Es T (s,a),a π(s )[Q(s , a )] , (3)

dπ(s, a) = (1 γ)µ0(s)π(a|s) + γ Pπ dπ(s, a), where Pπ d(s, a) := π(a|s) P

s, a T(s| s, a)d( s, a). (4)

Note that Pπ and Pπ are linear operators that are transposes (adjoints) of each other. We refer to Pπ as the policy transition operator and Pπ as the transpose policy transition operator. The function Qπ corresponds to the Q-values of the policy π; it maps state-action pairs (s, a) to the expected value of policy π when run in the environment starting at (s, a). The function dπ corresponds to the on-policy distribution of π; it is the normalized distribution over state-action pairs (s, a) measuring the likelihood π enounters the pair (s, a), averaging over time via γ-discounting. We make the following standard assumption, which is common in previous policy evaluation work [30, 22]. Assumption 1 (MDP ergodicity). There is unique ﬁxed point solution to (4).

When γ [0, 1), (4) always has a unique solution, as 0 cannot belong to the spectrum of I γPπ . For γ =1, the assumption reduces to ergodicity for discrete case under a restriction of d to a normalized distribution; the continuous case is treated by [18].

2.2 Off-policy Evaluation via the DICE Family

Off-policy evaluation (OPE) aims to estimate ρ(π) using only a ﬁxed dataset of experiences. Specifically, we assume access to a ﬁnite dataset D = {(s(i) 0 , s(i), a(i), r(i), s (i))}N i=1, where s(i) 0 µ0,

2With a slight terminology abuse, we use asymptotically unbiased and unbiased interchangeably. 3For simplicity, we focus on the discounted case where γ [0, 1) unless otherwise speciﬁed. The same conclusions generally hold for the undiscounted case with γ = 1; see Appendix E for more details. 4We consider a a deterministic reward function. All of our results apply to stochastic rewards as well.

(s(i), a(i)) d D are samples from some unknown distribution d D, r(i) = R(s(i), a(i)), and s (i) T(s(i), a(i)). We at times abuse notation and use (s, a, r, s ) d D or (s, a, r) d D as a shorthand for (s, a) d D, r = R(s, a), s T(s, a), which simulates sampling from the dataset D when using a ﬁnite number of samples.

The recent DICE methods take advantage of the following expression for the policy value: ρ(π) = E(s,a,r) d D [ζ (s, a) r] , (5)

where ζ (s, a) := dπ(s, a)/d D(s, a) is the distribution correction ratio. The existing DICE estimators seek to approximate this ratio without knowledge of dπ or d D, and then apply (5) to derive an estimate of ρ(π). This general paradigm is supported by the following assumption.

Assumption 2 (Boundedness). The stationary correction ratio is bounded, ζ C < .

When γ < 1, Dual DICE [21] chooses a convex objective whose optimal solution corresponds to this ratio, and employs a change of variables to transform the dependence on dπ to µ0. Gen DICE [30], on the other hand, minimizes a divergence between successive on-policy state-action distributions, and introduces a normalization constraint to ensure the estimated ratios average to 1 over the off-policy dataset. Both Dual DICE and Gen DICE apply Fenchel duality to reduce an intractable convex objective to a minimax objective, which enables sampling and optimization in a stochastic or continuous action space. Gradient DICE [31] extends Gen DICE by using a linear parametrization so that the minimax optimization is convex-concave with convergence guarantees.

3 A Uniﬁed Framework of DICE Estimators

In this section, given a ﬁxed target policy π, we present a linear programming representation (LP) of its state-action stationary distribution dπ (s, a) P, referred to as the d-LP. Here P represents the space of all stationary distributions. The dual of this LP has solution Qπ, thus revealing the duality between the Q-function and the d-distribution of any policy π. We then discuss the mechanisms by which one can improve optimization stability through the application of regularization and redundant constraints. Although in general this may introduce bias into the ﬁnal value estimate, there are a number of valid conﬁgurations for which the resulting estimator for ρ(π) remains unbiased. We show that existing DICE algorithms cover several choices of these conﬁgurations, while there is also a sizable subset which remains unexplored.

3.1 Linear Programming Representation for the dπ-distribution

The following theorem presents a formulation of ρ(π) in terms of a linear program with respect to the constraints in (4) and (3).

Theorem 1. Given a policy π, under Assumption 1, its value ρ (π) deﬁned in (1) can be expressed by the following d-LP: max d:S A R Ed [R (s, a)] , s.t., d(s, a) = (1 γ)µ0(s)π(a|s) + γ Pπ d(s, a) | {z } Bπ d

We refer to the d-LP above as the dual problem. Its corresponding primal LP is min Q:S A R (1 γ) Eµ0π [Q (s, a)] , s.t., Q(s, a) = R(s, a) + γ PπQ(s, a) | {z } Bπ Q

Proof. Notice that under Assumption 1, the constraint in (6) determines a unique solution, which is the stationary distribution dπ. Therefore, the objective will be ρ (π) by deﬁnition. On the other hand, due to the contraction of γ Pπ, the primal problem is feasible and the solution is Qπ, which shows the optimal objective value will also be ρ (π), implying strong duality holds.

Theorem 1 presents a succinct LP representation for policy value and reveals the duality between the Qπ-function and dπ-distribution, thus providing an answer to the question raised by [28]. Although the d-LP provides a mechanism for policy evaluation, directly solving either the primal or dual d-LPs is difﬁcult due to the number of constraints, which will present difﬁculties when the state and action spaces is uncountable. These issues are exaggerated in the off-policy setting where one only has access to samples (s0, s, a, r, s ) from a stochastic process. To overcome these difﬁculties, one can

instead approach these primal and dual LPs through the Lagrangian,

maxd min Q L(d, Q) := (1 γ) Ea0 π(s0) s0 µ0 [Q(s0, a0)] + P

s,a d(s, a) (R(s, a) + γPπQ(s, a) Q(s, a)).

In order to enable the use of an arbitrary off-policy distribution d D, we make the change of variables ζ(s, a) := d(s, a)/d D(s, a). This yields an equivalent Lagrangian in a more convenient form: max ζ min Q LD(ζ, Q) :=(1 γ) Ea0 π(s0) s0 µ0 [Q(s0, a0)]

+ E(s,a,r,s ) d D

a π(s ) [ζ(s, a) (r + γQ(s , a ) Q(s, a))]. (8)

The Lagrangian has primal and dual solutions Q = Qπ and ζ = dπ/d D. Approximate solutions to one or both of ˆQ, ˆζ can be used to estimate ˆρ(π), by either using the standard DICE paradigm in (5) which corresponds to the dual objective in (6) or, alternatively, by using the primal objective in (7) or the Lagrangian objective in (8); we further discuss these choices later in this section. Although the Lagrangian in (8) should in principle be able to derive the solutions Qπ, dπ and so yield accurate estimates of ρ(π), in practice there are a number of optimization difﬁculties that are liable to be encountered. Speciﬁcally, even in tabular case, due to lack of curvature, the Lagrangian is not strongly-convex-strongly-concave, and so one cannot guarantee the convergence of the ﬁnal solution with stochastic gradient descent-ascent (SGDA). These optimization issues can become more severe when moving to the continuous case with neural network parametrization, which is the dominant application case in practice. In order to mitigate these issues, we present a number of ways to introduce more stability into the optimization and discuss how these mechanisms may trade-off with the bias of the ﬁnal estimate. We will show that the application of certain mechanisms recovers the existing members of the DICE family, while a larger set remains unexplored.

3.2 Regularizations and Redundant Constraints

The augmented Lagrangian method (ALM) [25] is proposed exactly for circumventing the optimization instability, where strong convexity is introduced by adding extra regularizations without changing the optimal solution. However, directly applying ALM, i.e., adding hp (Q) := Bπ Q Q 2 d D or hd (d) := Df (d||Bπ d) where Df denotes the f-divergence, will introduce extra difﬁculty, both statistically and algorithmically, due to the conditional expectation operator in Bπ and Bπ inside of the non-linear function in hp (Q) and hd (d), which is known as double sample in the RL literature [1]. Therefore, the vanilla stochastic gradient descent is no longer applicable [5], due to the bias in the gradient estimator.

In this section, we use the spirit of ALM but explore other choices of regularizations to introduce strong convexity to the original Lagrangian (8). In addition to regularizations, we also employ the use of redundant constraints, which serve to add more structure to the optimization without affecting the optimal solutions. We will later analyze for which conﬁgurations these modiﬁcations of the original problem will lead to biased estimates for ρ(π).

We ﬁrst present the uniﬁed objective in full form equipped with all choices of regularizations and redundant constraints: max ζ 0 min Q,λ LD(ζ, Q, λ) :=(1 γ) Ea0 π(s0) s0 µ0 [Q(s0, a0)] + λ

+ E(s,a,r,s ) d D

a π(s ) [ζ(s, a) (αR R(s, a) + γQ(s , a ) Q(s, a) λ)]

+ αQ E(s,a) d D[f1(Q(s, a))] αζ E(s,a) d D[f2(ζ(s, a))]. (9)

Now, let us explain each term in (αQ, αζ, αR, ζ 0, λ).

Primal and Dual regularization: To introduce better curvature into the Lagrangian, we introduce primal and dual regularization αQEd D [f1 (Q)] or αζEd D [f2 (ζ)], respectively. Here f1, f2 are some convex and lower-semicontinuous functions. Reward: Scaling the reward may be seen as an extension of the dual regularizer, as it is a component in the dual objective in (6). We consider αR {0, 1}.

Positivity: Recall that the solution to the original Lagrangian is ζ (s, a) = dπ(s,a)

d D(s,a) 0. We thus consider adding a positivity constraint to the dual variable. This may be interpreted as modifying the original d-LP in (6) to add a condition d 0 to its objective.

Normalization: Similarly, the normalization constraint also comes from the property of the optimal solution ζ (s, a), i.e., Ed D [ζ (s, a)] = 1. If we add an extra constraint to the d-LP (6) as P

s,a d(s, a) = 1 and apply the Lagrangian, we result in the term λ Ed D [λζ (s, a)] seen in (9).

As we can see, the latter two options come from the properties of optimal dual solution, and this suggests that their inclusion would not affect the optimal dual solution. On the other hand, the ﬁrst two options (primal/dual regularization and reward scaling) will in general affect the solutions to the optimization. Whether a bias in the solution affects the ﬁnal estimate depends on the estimator being used.

Remark (Robust optimization justiﬁcation): Besides the motivation from ALM for strong convexity, the regularization terms in (9), αQ E(s,a) d D[f1(Q(s, a))] and αζ E(s,a) d D[f2(ζ(s, a))], can also be interpreted as introducing robustness via perturbation to the Bellman difference. Consider dual regularization as an example. The Fenchel dual of αζ E(s,a) d D[f2(ζ(s, a))] gives

n maxδ(s,a) Ω ζ, δ E(s,a) d D [f 2 (δ (s, a))] o , where Ωdenotes the domain of function f 2 .

By plugging f 2 ( ) = ( )2 back into (9), we obtain max ζ 0 min Q,λ,δ ΩLD(ζ, Q, λ) := (1 γ) Ea0 π(s0) s0 µ0 [Q(s0, a0)] + λ

+ E(s,a,r,s ) d D

a π(s ) [ζ(s, a) (αR R(s, a) + γQ(s , a ) Q(s, a) λ αζδ (s, a))]

+ αQ E(s,a) d D[f1(Q(s, a))] + αζ E(s,a) d D[δ2 (s, a)], (10) which can be understood as introducing slack variables (i.e., perturbations in the L2-ball) to the Bellman difference αR R(s, a) + γQ(s , a ) Q(s, a). Different dual regularizations will result in perturbation in different dual spaces. From this perspective, the dual regularization mitigates both sampling error in approximating the Bellman difference and approximation error induced by the parametrization of Q. Similarly, the primal regularization can be interpreted as introducing slack variables to the stationary state-action distribution condition. Please refer to Appendix A for details.

Given estimates ˆQ, ˆλ, ˆζ, there are three potential ways to estimate ρ(π).

Primal estimator: ˆρQ(π) := (1 γ) Ea0 π(s0) s0 µ0 [ ˆQ(s0, a0)] + ˆλ.

Dual estimator: ˆρζ(π) := E(s,a,r) d D[ˆζ(s, a) r].

Lagrangian: ˆρQ,ζ(π) := ˆρQ(π) + ˆρζ(π) + E(s,a,r,s ) d D

h ˆζ (s, a) (γ ˆQ(s , a ) ˆQ(s, a) ˆλ) i .

The following theorem outlines when a choice of regularizations, redundant constraints, and ﬁnal estimator will provably result in an unbiased estimate of policy value. Theorem 2 (Regularization proﬁling). Under Assumption 1 and 2, we summarize the effects of (αQ, αζ, αR, ζ 0, λ), which corresponds to primal and dual regularizations, w/w.o. reward, and

positivity and normalization constraints. without considering function approximation.

Regularization (with or without λ) ˆρQ ˆρζ ˆρQ,ζ

αR = 1 ζ free Unbiased Biased Unbiased αζ = 0 ζ 0

Biased αQ > 0 αR = 0 ζ free

Unbiased Unbiased

αR = 1 ζ free αζ > 0 ζ 0 αQ = 0 αR = 0 ζ free

Notice that the primal and dual solutions can both be unbiased under speciﬁc regularization conﬁgurations, but the dual solutions are unbiased in 6 out of 8 such conﬁgurations, whereas the primal

solution is unbiased in only 1 conﬁguration. The primal solution additionally requires the positivity constraint to be excluded (see details in Appendix B), further restricting its optimization choices.

The Lagrangian estimator is unbiased when at least one of ˆQ, ˆλ or ˆζ are unbiased. This property is referred to as doubly robust in the literature [12] This seems to imply that the Lagrangian estimator is optimal for behavior-agnostic off-policy evaluation. However, this is not the case as we will see in the empirical analysis. Instead, the approximate dual solutions are typically more accurate than approximate primal solutions. Since neither is exact, the Lagrangian suffers from error in both, while the dual estimator ˆρζ will exhibit more robust performance, as it solely relies on the approximate ˆζ.

3.3 Recovering Existing OPE Estimators

This organization provides a complete picture of the DICE family of estimators. Existing DICE estimators can simply be recovered by picking one of the valid regularization conﬁgurations:

Dual DICE [21]: (αQ = 0, αζ = 1, αR = 0) without ζ 0 and without λ. Dual DICE also derives an unconstrained primal form where optimization is exclusively over the primal variables (see Appendix D). This form results in a biased estimate but avoids difﬁcults in minimax optimization, which again is a tradeoff between optimization stability and solution unbiasedness. Gen DICE [30] and Gradient DICE [31]: (αQ = 1, αζ = 0, αR = 0) with λ. Gen DICE differs from Gradient DICE in that Gen DICE enables ζ 0 whereas Gradient DICE disables it. DR-MWQL and MWL [28]:(αQ = 0, αζ = 0, αR = 1) and (αQ = 0, αζ = 0, αR = 0), both without ζ 0 and without λ. LSTDQ [15]: With linear parametrization for ζ(s, a) = v φ (s, a) and Q (s, a) = w φ (s, a), for any unbiased estimator without ζ 0 and λ in Theorem 2, we can recover LSTDQ. Please refer to Appendix C for details. Algae Q-LP [22]: (αQ = 0, αζ = 1, αR = 1, ζ 0) without ζ 0 and without λ. Best DICE: (αQ = 0, αζ = 1, αR = 0/1) with ζ 0 and with λ. More importantly, we discover a variant that achieves the best performance, which was not identiﬁed without this uniﬁed framework.

4 Experiments

In this section, we empirically verify the theoretical ﬁndings. We evaluate different choices of estimators, regularizers, and constraints, on a set of OPE tasks ranging from tabular (Grid) to discrete-control (Cartpole) and continuous-control (Reacher), under linear and neural network parametrizations, with ofﬂine data collected from behavior policies with different noise levels (π1 and π2). See Appendix F for implementation details and additional results. Our empirical conclusions are as follows:

The dual estimator ˆρζ is unbiased under more conﬁgurations and yields best performance out of all estimators, and furthermore exhibits strong robustness to scaling and shifting of MDP rewards. Dual regularization (αζ > 0) yields better estimates than primal regularization; the choice of αR {0, 1} exhibits a slight advantage to αR = 1. The inclusion of redundant constraints (λ and ζ 0) improves stability and estimation performance. As expected, optimization using the unconstrained primal form is more stable but also more biased than optimization using the minimax regularized Lagrangian.

Based on these ﬁndings, we propose a particular set of choices that generally performs well, overlooked by previously proposed DICE estimators: the dual estimator ˆρζ with regularized dual variable (αζ > 0, αR = 1) and redundant constraints (λ, ζ 0) optimized with the Lagrangian.

4.1 Choice of Estimator (ˆρQ, ˆρζ, or ˆρQ,ζ)

We ﬁrst consider the choice of estimator. In each case, we perform Lagrangian optimization with regularization chosen according to Theorem 2 to not bias the resulting estimator. We also use αR = 1 and include redundant constraints for λ and ζ 0 in the dual estimator. Although not shown, we also evaluated combinations of regularizations which can bias the estimator (as well as no regularizations) and found that these generally performed worse; see Section 4.2 for a subset of these experiments.

0k 40k 80k 120k 160k 0.9

1.1 Grid (π1)

0k 40k 80k 120k 160k 0.9

1.1 Grid (π2)

0k 40k 80k 120k 160k 0.75

Reacher (π1)

0k 40k 80k 120k 160k 0.75

Reacher (π2)

0k 40k 80k 120k 160k 0.6

1.2 Cartpole (π1)

0k 40k 80k 120k 160k 0.6

1.2 Cartpole (π2)

Train steps

Per-step reward

Dual est. Dual reg.

Lagrangian est. Dual reg.

Lagrangian est. Primal reg.

Primal est. Primal reg. True value Dual est. Dual reg.

Lagrangian est. Dual reg.

Lagrangian est. Primal reg.

Primal est. Primal reg. True value Dual est. Dual reg.

Lagrangian est. Dual reg.

Lagrangian est. Primal reg.

Primal est. Primal reg. True value Dual est. Dual reg.

Lagrangian est. Dual reg.

Lagrangian est. Primal reg.

Primal est. Primal reg. True value Dual est. Dual reg.

Lagrangian est. Dual reg.

Lagrangian est. Primal reg.

Primal est. Primal reg. True value Dual est. Dual reg.

Lagrangian est. Dual reg.

Lagrangian est. Primal reg.

Primal est. Primal reg. True value

Figure 1: Estimation results on Grid, Reacher, and Cartpole using data collected from different behavior policies (π2 is closer to the target policy than π1). Biased estimator-regularizer combinations from Theorem 2 are omitted. The dual estimator with regularized dual variable outperforms all other estimators/regularizers. Lagrangian can be as good as the dual but has a larger variance.

0k 40k 80k 120k 160k

Grid (per-step R)

Scale=1.0, Shift=0.0

0k 40k 80k 120k 160k

0k 40k 80k 120k 160k

0k 40k 80k 120k 160k

Scale=100.0

0k 40k 80k 120k 160k

0k 40k 80k 120k 160k

Train steps

Dual est. Dual reg.

Lagrangian est. Dual reg.

Primal est. Primal reg.

Lower bound

Upper bound

True value Dual est. Dual reg.

Lagrangian est. Dual reg.

Primal est. Primal reg.

Lower bound

Upper bound

True value Dual est. Dual reg.

Lagrangian est. Dual reg.

Primal est. Primal reg.

Lower bound

Upper bound

True value Dual est. Dual reg.

Lagrangian est. Dual reg.

Primal est. Primal reg.

Lower bound

Upper bound

True value Dual est. Dual reg.

Lagrangian est. Dual reg.

Primal est. Primal reg.

Lower bound

Upper bound

True value Dual est. Dual reg.

Lagrangian est. Dual reg.

Primal est. Primal reg.

Lower bound

Upper bound

Scale=1.0, Shift=0.0 Scale=0.5 Scale=10.0 Shift=5.0 Shift=10.0 Exp

0k 40k 80k 120k 160k 0.50

0k 40k 80k 120k 160k 0k 40k 80k 120k 160k 0k 40k 80k 120k 160k 0k 40k 80k 120k 160k 0k 40k 80k 120k 160k Train steps

Figure 2: Primal (red), dual (blue), and Lagrangian (green) estimates under linear (top) and neural network (bottom) parametrization when rewards are transformed during training. Estimations are transformed back and plotted on the original scale. The dual estimates are robust to all transformations, whereas the primal and Lagrangian estimates are sensitive to the reward values.

Our evaluation of different estimators is presented in Figure 1. We ﬁnd that the dual estimator consistently produces the best estimates across different tasks and behavior policies. In comparison, the primal estimates are signiﬁcantly worse. While the Lagrangian estimator can improve on the primal, it generally exhibits higher variance than the dual estimator. Presumably, the Lagrangian does not beneﬁt from the doubly robust property, since both solutions are biased in this practical setting.

To more extensively evaluate the dual estimator, we investigate its performance when the reward function is scaled by a constant, shifted by a constant, or exponentiated. 5 To control for difﬁculties in optimization, we ﬁrst parametrize the primal and dual variables as linear functions, and use stochastic gradient descent to solve the convex-concave minimax objective in (9) with αQ = 0, αζ = 1, and αR = 1. Since a linear parametrization changes the ground truth of evaluation, we compute the upper and lower estimation bounds by only parameterizing the primal or the dual variable as a linear function. Figure 2 (top) shows the estimated per-step reward of the Grid task. When the original reward is used (col. 1), the primal, dual, and Lagrangian estimates eventually converge to roughly the same value (even though primal estimates converge much slower). When the reward is scaled by 10 or 100 times or shifted by 5 or 10 units (the original reward is between 0 and 1), the resulting primal estimates are severely affected and do not converge given the same number of gradient updates. When performing this same evaluation with neural network parametrization (Figure 2, bottom), the primal estimates continue to exhibit sensitivity to reward transformations, whereas the dual estimates stay roughly the same after being transformed back to the original scale. We further implemented target

5Note this is separate from αR, which only affects optimization. We use αR = 1 exclusively here.

0k 40k 80k 120k 160k 0.80

1.00 Grid (π1)

0k 40k 80k 120k 160k 0.80

1.00 Grid (π2)

0k 40k 80k 120k 160k 0.3

0.0 Reacher (π1)

0k 40k 80k 120k 160k 0.3

0.0 Reacher (π2)

0k 40k 80k 120k 160k 0.6

Cartpole (π1)

0k 40k 80k 120k 160k 0.6

Cartpole (π2)

Train steps

Per-step reward

Dual est. Dual reg. αR = 0

Dual est. Dual reg. αR = 1

Dual est. Primal reg. αR = 0

Dual est. Primal reg. αR = 1 True value Dual est. Dual reg. αR = 0

Dual est. Dual reg. αR = 1

Dual est. Primal reg. αR = 0

Dual est. Primal reg. αR = 1 True value Dual est. Dual reg. αR = 0

Dual est. Dual reg. αR = 1

Dual est. Primal reg. αR = 0

Dual est. Primal reg. αR = 1 True value Dual est. Dual reg. αR = 0

Dual est. Dual reg. αR = 1

Dual est. Primal reg. αR = 0

Dual est. Primal reg. αR = 1 True value Dual est. Dual reg. αR = 0

Dual est. Dual reg. αR = 1

Dual est. Primal reg. αR = 0

Dual est. Primal reg. αR = 1 True value Dual est. Dual reg. αR = 0

Dual est. Dual reg. αR = 1

Dual est. Primal reg. αR = 0

Dual est. Primal reg. αR = 1 True value

Figure 3: Dual estimates when αR = 0 (dotted line) and αR = 1 (solid line). Regularizing the dual variable (blue) is consistently better than regularizing the primal variable (orange). αR = 0 and αQ = 0 leads to biased estimation (solid orange). The value of αR does not affect the ﬁnal estimate when αζ = 1, αQ = 0.

network for training stability of the primal variable, and the same concolusion holds (see Appendix). Note that while the dual solution is robust to the scale and range of rewards, the optimization objective used here still has αR = 1, which is different from αR = 0 where ˆρQ is no longer a valid estimator.

4.2 Choice of Regularization (αζ, αR, and αQ)

Next, we study the choice between regularizing the primal or dual variables. Given the results of Section 4.1, we focus on ablations using the dual estimator ˆρζ to estimate ρπ. Results are presented in Figure 3. As expected, we see that regularizing the primal variables when αR = 1 leads to a biased estimate, especially in Grid (π1), Reacher (π2), and Cartpole. Regularizing the dual variable (blue lines) on the other hand does not incur such a bias. Additionally, the value of αR has little effect on the ﬁnal estimates when the dual variable is regularized (dotted versus solid blue lines). While the invariance to αR may not generalize to other tasks, an advantage of the dual estimates with regularized dual variable is the ﬂexibility to set αR = 0 or 1 depending on the reward function.

0k 40k 80k 120k 160k 0.30

γ = 0.999 Reacher (π2)

0k 40k 80k 120k 160k 0.30

γ = 1.0 Reacher (π2)

0k 40k 80k 120k 160k 1.0

γ = 0.999 Cartpole (π1)

0k 40k 80k 120k 160k 1.0

γ = 1.0 Cartpole (π1)

0k 40k 80k 120k 160k 1.0

γ = 0.999 Cartpole (π2)

0k 40k 80k 120k 160k 1.0

γ = 1.0 Cartpole (π2)

Train steps

Per-step reward

Positivity Positivity + Normalization True value Positivity Positivity + Normalization True value Positivity Positivity + Normalization True value Positivity Positivity + Normalization True value Positivity Positivity + Normalization True value Positivity Positivity + Normalization True value

0k 40k 80k 120k 160k 0.900

1.000 Grid (π1)

0k 40k 80k 120k 160k 0.900

1.000 Grid (π2)

0k 40k 80k 120k 160k 0.3

0.0 Reacher (π1)

0k 40k 80k 120k 160k 0.3

0.0 Reacher (π2)

0k 40k 80k 120k 160k 0.6

Cartpole (π1)

0k 40k 80k 120k 160k 0.6

Cartpole (π2)

Train steps

Per-step reward

Normalization Normalization + Positivity True value Normalization Normalization + Positivity True value Normalization Normalization + Positivity True value Normalization Normalization + Positivity True value Normalization Normalization + Positivity True value Normalization Normalization + Positivity True value

0k 40k 80k 120k 160k 0.92

0k 40k 80k 120k 160k 0.92

0k 40k 80k 120k 160k 0.3

0.1 Reacher (π1)

0k 40k 80k 120k 160k 0.3

0.1 Reacher (π2)

0k 40k 80k 120k 160k 0.80

1.00 Cartpole (π1)

0k 40k 80k 120k 160k 0.80

1.00 Cartpole (π2)

Train steps

Per-step reward

Lagrangian Unconstrained primal form True value Lagrangian Unconstrained primal form True value Lagrangian Unconstrained primal form True value Lagrangian Unconstrained primal form True value Lagrangian Unconstrained primal form True value Lagrangian Unconstrained primal form True value

Figure 4: Apply positive constraint, normalization constraint, and the unconstrained primal form during optimization (blue curves). Positivity constraint (row 1) improves training stability. Normalization constraint is essential when γ = 1, and also helps when γ < 1 (row 2). Solving the unconstrained primal problem (row 3) can be useful when the action space is discrete.

4.3 Choice of Redundant Constraints (λ and ζ 0)

So far our experiments with the dual estimator used λ and ζ 0 in the optimizations, corresponding to the normalization and positive constraints in the d-LP. However, these are in principle not necessary

when γ < 1, and so we evaluate the effect of removing them. Given the results of the previous sections, we focus our ablations on the use of the dual estimator ˆρζ with dual regularization αζ > 0, αR = 1.

Normalization. We consider the effect of removing the normalization constraint (λ). Figure 4 (row 1) shows the effect of keeping (blue curve) or removing (red curve) this constraint during training. We see that training becomes less stable and approximation error increases, even when γ < 1.

Positivity. We continue to evaluate the effect of removing the positivity constraint ζ 0, which, in our previous experiments, was enforced via applying a square function to the dual variable neural network output. Results are presented in Figure 4 (row 2), where we again see that the removal of this constraint is detrimental to optimization stability and estimator accuracy.

4.4 Choice of Optimization (Lagrangian or Unconstrained Primal Form)

So far, our experiments have used minimax optimization via the Lagrangian to learn primal and dual variables. We now consider solving the unconstrained primal form of the d-LP, which Section 3.2 suggests may lead to an easier, but biased, optimization. Figure 4 (row 3) indeed shows that the unconstrained primal reduces variance on Grid and produces better estimates on Cartpole. Both environments have discrete action spaces. Reacher, on the other hand, has a continuous action space, which creates difﬁculty when taking the expectation over next step samples, causing bias in the unconstrained primal form. Given this mixed performance, we generally advocate for the Lagrangian, unless the task is discrete-action and the stochasticity of the dynamics is known to be low.

5 Related work

Off-policy evaluation has long been studied in the RL literature [9, 12, 13, 19, 23, 27]. While some approaches are model-based [10], or work by estimating the value function [8], most rely on importance reweighting to transform the off-policy data distribution to the on-policy target distribution. They often require to know or estimate the behavior policy, and suffer a variance exponential in the horizon, both of which limit their applications. Recently, a series of works were proposed to address these challenges [14, 17, 26]. Among them is the DICE family [21, 30, 31], which performs some form of stationary distribution estimation. The present paper develops a convex duality framework that uniﬁes many of these algorithms, and offers further important insights. Many OPE algorithms may be understood to correspond to the categories considered here. Naturally, the recent stationary distribution correction algorithms [21, 30, 31], are the dual methods. The FQI-style estimator [8] loosely corresponds to our primal estimator. Moreover, Lagrangian-type estimators are also considered [26, 28], although some are not for the behavior-agnostic setting [26].

Convex duality has been widely used in machine learning, and in RL in particular. In one line of literature, it was used to solve the Bellman equation, whose ﬁxed point is the value function [6, 7, 16]. Here, duality facilitates derivation of an objective function that can be conveniently approximated by sample averages, so that solving for the ﬁxed point is converted to that of ﬁnding a saddle point. Another line of work, more similar to the present paper, is to optimize the Lagrangian of the linear program that characterizes the value function [2, 4, 29]. In contrast to our work, these algorithms typically do not incorporate off-policy correction, but assume the availability of on-policy samples.

6 Conclusion

We have proposed a uniﬁed view of off-policy evaluation via the regularized Lagrangian of the d-LP. Under this uniﬁcation, existing DICE algorithms are recovered by speciﬁc (suboptimal) choices of regularizers, (redundant) constraints, and ways to convert optimized solutions to policy values. By systematically studying the mathematical properties and empirical effects of these choices, we have found that the dual estimates (i.e., policy value in terms of the state-action distribution) offer greater ﬂexibility in incorporating optimization stablizers while preserving asymptotic unibasedness, in comparison to the primal estimates (i.e., estimated Q-values). Our study also reveals alternative estimators not previously identiﬁed in the literature that exhibit improved performance. Overall, these ﬁndings suggest promising new directions of focus for OPE research in the ofﬂine setting.

Broader Impact

One of the broader issues in reinforcement learning is reproducibility it is difﬁcult to make a compelling case that any new algorithm is actually an improvement without a common framework for reliably reproducing past related research results. This paper uniﬁes existing DICE estimators and offers a standard implementation of a number of seemingly distinct algorithms that in fact only differ in regularization. Future researchers and practioners in this domain can beneﬁt from this uniﬁcation, where algorithms can be effectively compared in isolation from other artifacts such as neural network architecture.

Meanwhile, this research has revealed a number of choices in regularization, redundant constraints, and ﬁnal estimators in DICE, suggesting a large hyperparameter search space. While we try our best to illustrate which choices matter through mathematical analysis and ablation study, individuals with limited computational resources could still be put at a disadvantage when tuning their algorithms to achieve the best performance for a speciﬁc task.

Acknowledgments

We thank Hanjun Dai and other members of the Google Brain team for helpful discussions.

[1] Leemon Baird. Residual algorithms: Reinforcement learning with function approximation. In In Proceedings of the 12th International Conference on Machine Learning, pages 30 37. Morgan Kaufmann, 1995.

[2] Joan Bas-Serrano and Gergely Neu. Faster saddle-point optimization for solving large-scale Markov decision processes, 2019. ar Xiv:1909.10904.

[3] Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. Open AI Gym. ar Xiv preprint ar Xiv:1606.01540, 2016.

[4] Yichen Chen, Lihong Li, and Mengdi Wang. Scalable bilinear π learning using state and action features. In Proceedings of the 32nd International Conference on Machine Learning, 2018.

[5] Bo Dai, Niao He, Yunpeng Pan, Byron Boots, and Le Song. Learning from conditional distributions via dual embeddings. ar Xiv preprint ar Xiv:1607.04579, 2016.

[6] Bo Dai, Albert Shaw, Lihong Li, Lin Xiao, Niao He, Zhen Liu, Jianshu Chen, and Le Song. SBEED: Convergent reinforcement learning with nonlinear function approximation. In Proceedings of the 35th International Conference on Machine Learning, pages 1133 1142, 2018.

[7] Simon S Du, Jianshu Chen, Lihong Li, Lin Xiao, and Dengyong Zhou. Stochastic variance reduction methods for policy evaluation. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 1049 1058. JMLR. org, 2017.

[8] Yaqi Duan and Mengdi Wang. Minimax-optimal off-policy evaluation with linear function approximation, 2020. ar Xiv:2002.09516.

[9] Mehrdad Farajtabar, Yinlam Chow, and Mohammad Ghavamzadeh. More robust doubly robust off-policy evaluation. ar Xiv preprint ar Xiv:1802.03493, 2018.

[10] Raphael Fonteneau, Susan A. Murphy, Louis Wehenkel, and Damien Ernst. Batch mode reinforcement learning based on the synthesis of artiﬁcial trajectories. Annals of Operations Research, 208(1):383 416, 2013.

[11] H. Van Hasselt, A. Guez, and D. Silver. Deep reinforcement learning with double Q-learning. In AAAI, volume 16, pages 2094 2100, 2016.

[12] Nan Jiang and Lihong Li. Doubly robust off-policy value evaluation for reinforcement learning. In Proceedings of the 33rd International Conference on Machine Learning, 2016.

[13] Nathan Kallus and Masatoshi Uehara. Double reinforcement learning for efﬁcient off-policy evaluation in Markov decision processes. ar Xiv preprint ar Xiv:1908.08526, 2019.

[14] Nathan Kallus and Masatoshi Uehara. Efﬁciently breaking the curse of horizon: Double reinforcement learning in inﬁnite-horizon processes. ar Xiv preprint ar Xiv:1909.05850, 2019.

[15] Michail G Lagoudakis and Ronald Parr. Least-squares policy iteration. Journal of Machine Learning Research, 4(Dec):1107 1149, 2003. [16] Bo Liu, Ji Liu, Mohammad Ghavamzadeh, Sridhar Mahadevan, and Marek Petrik. Finite-sample analysis of proximal gradient TD algorithms. In Proc. The 31st Conf. Uncertainty in Artiﬁcial Intelligence, Amsterdam, Netherlands, 2015. [17] Qiang Liu, Lihong Li, Ziyang Tang, and Dengyong Zhou. Breaking the curse of horizon: Inﬁnite-horizon off-policy estimation. In Advances in Neural Information Processing Systems, pages 5356 5366, 2018. [18] Sean P Meyn and Richard L Tweedie. Markov Chains and Stochastic Stability. Springer Science & Business Media, 2012. [19] R. Munos, T. Stepleton, A. Harutyunyan, and M. Bellemare. Safe and efﬁcient off-policy reinforcement learning. In Advances in Neural Information Processing Systems, pages 1054 1062, 2016. [20] S. Murphy, M. van der Laan, and J. Robins. Marginal mean models for dynamic regimes. Journal of American Statistical Association, 96(456):1410 1423, 2001. [21] Oﬁr Nachum, Yinlam Chow, Bo Dai, and Lihong Li. Dual DICE: Behavior-agnostic estimation of discounted stationary distribution corrections. In Advances in Neural Information Processing Systems, pages 2315 2325, 2019. [22] Oﬁr Nachum, Bo Dai, Ilya Kostrikov, Yinlam Chow, Lihong Li, and Dale Schuurmans. Algae DICE: Policy gradient from arbitrary experience, 2019. [23] Doina Precup, Richard S. Sutton, and Satinder P. Singh. Eligibility traces for off-policy policy evaluation. In Proceedings of the 17th International Conference on Machine Learning, pages 759 766, 2000. [24] Martin L Puterman. Markov Decision Processes: Discrete Stochastic Dynamic Programming. John Wiley & Sons, Inc., 1994. [25] R Tyrrell Rockafellar. Augmented Lagrange multiplier functions and duality in nonconvex programming. SIAM Journal on Control, 12(2):268 285, 1974. [26] Ziyang Tang, Yihao Feng, Lihong Li, Dengyong Zhou, and Qiang Liu. Doubly robust bias reduction in inﬁnite horizon off-policy estimation. In Proceedings of the 8th International Conference on Learning Representations, 2020. [27] P. Thomas, G. Theocharous, and M. Ghavamzadeh. High conﬁdence off-policy evaluation. In Proceedings of the 29th Conference on Artiﬁcial Intelligence, 2015. [28] Masatoshi Uehara and Nan Jiang. Minimax weight and Q-function learning for off-policy evaluation. ar Xiv preprint ar Xiv:1910.12809, 2019. [29] Mengdi Wang. Randomized linear programming solves the discounted Markov decision problem in nearly-linear (sometimes sublinear) running time. ar Xiv preprint ar Xiv:1704.01869, 2017. [30] Ruiyi Zhang, Bo Dai, Lihong Li, and Dale Schuurmans. Gen DICE: Generalized ofﬂine estimation of stationary values. In International Conference on Learning Representations, 2020. [31] Shangtong Zhang, Bo Liu, and Shimon Whiteson. Gradient DICE: Rethinking generalized ofﬂine estimation of stationary values. ar Xiv preprint ar Xiv:2001.11113, 2020.