# occupancybased_policy_gradient_estimation_convergence_and_optimality__fde3e9bc.pdf Occupancy-based Policy Gradient: Estimation, Convergence, and Optimality Audrey Huang Department of Computer Science University of Illinois Urbana-Champaign Champaign, IL 61820 audreyh5@illinois.edu Nan Jiang Department of Computer Science University of Illinois Urbana-Champaign Champaign, IL 61820 nanjiang@illinois.edu Occupancy functions play an instrumental role in reinforcement learning (RL) for guiding exploration, handling distribution shift, and optimizing general objectives beyond the expected return. Yet, computationally efficient policy optimization methods that use (only) occupancy functions are virtually non-existent. In this paper, we establish the theoretical foundations of model-free policy gradient (PG) methods that compute the gradient through the occupancy for both online and offline RL, without modeling value functions. Our algorithms reduce gradient estimation to squared-loss regression and are computationally oracle-efficient. We characterize the sample complexities of both local and global convergence, accounting for both finite-sample estimation error and the roles of exploration (online) and data coverage (offline). Occupancy-based PG naturally handles arbitrary offline data distributions, and, with one-line algorithmic changes, can be adapted to optimize any differentiable objective functional. 1 Introduction Value-based methods have been the dominant paradigm in model-free reinforcement learning, with a solid theoretical foundation in large state spaces under function approximation [CJ19; JYWJ20a; ZLKB20; JLM21; XJ21; XFBJK22]. In contrast, a model-free RL paradigm based on their natural counterparts the occupancy functions remains largely under-investigated. Occupancy functions are densities that describe a policy s state visitation, and play instrumental roles in guiding exploration [HKSVS19; AFK24], handling distribution shift [HM17; NCDL19; CJ22], and optimizing general objectives beyond the expected return [ZBWK20; MDSDBR22]. Despite this, they are seldom modeled directly in learning algorithms and appear only in the analyses, except in conjunction with value functions in marginalized importance sampling [LLTZ18; NDKCLS19; UHJ20; ZHHJL22; HJ22a]. Recently, [HCJ23] developed algorithms in online and offline RL that model only occupancies via density function classes, spotlighting their roles in handling non-exploratory offline data and in online exploration. However, their focus was on statistical guarantees, and computationally efficient policy optimization for occupancy-based methods remained an open problem. In answer, we develop model-free policy gradient (PG) algorithms that compute the gradient through occupancy functions, without estimating any values. By leveraging a Bellman-like recursion, we reduce occupancy-based gradient estimation to solving a series of squared-loss minimization problems, which can be done in a computationally oracle-efficient manner. Our analysis captures the effects of gradient estimation error, exploration (in online PG, which is characterized by the initial state distribution), and offline data quality (in offline PG) on the sample and iteration complexity required for local and global convergence. In the online setting, our results complement previous works on the optimality of value-based PG [AKLM21; BR24] and extend past their scope to in- 38th Conference on Neural Information Processing Systems (Neur IPS 2024). clude general objectives of occupancy functions, such as entropy maximization for pure exploration and risk-sensitive functionals in safe RL [MDSDBR22]. These objectives generally cannot be optimized using value-based policy gradients because they do not admit value functions or Bellman-like equations with which to estimate them [ZBWK20; HDGP24]. In the offline setting, we handle gradient estimation from fixed datasets of poor coverage, which departs from most existing (value-based) off-policy PG estimators that assume an exploratory dataset [KU20; XYWL21; NZJZW22]. Learning with non-exploratory data is a core consideration in recent offline RL [XCJMA21; ZHHJL22], and gives rise to unique challenges in our setting: occupancies are converted into density ratios for learning purposes, but these ratios become unbounded when the data lacks coverage. [HCJ23] used clipping to handle occupancy estimation under poor coverage, which we show is insufficient for gradient estimation (Prop. 4.2). Instead, a novel smooth-clipping mechanism (Sec. 4.2) is developed to provide statistically robust gradient estimates. App. A includes a full discussion of related work, and our contributions are organized as follows: 1. Online PG (Sec. 3) We propose OCCUPG, an occupancy-based PG algorithm that reduces gradient estimation to squared-loss minimization, based on a recursive Bellman flow-like update for the occupancy gradient. We analyze the sample complexities for both local and global convergence, and, notably, our algorithm and analyses extend straightforwardly to the optimization of general objective functionals. 2. Offline PG (Sec. 4) For offline RL we develop and analyze OFF-OCCUPG, which optimizes only the portions of a policy s return that are adequately covered by offline data. Conceptually, our algorithm is based on combining the methods in Sec. 3 with (a smoothed version of) the recursively clipped occupancies from [HCJ23]. As a result, our estimation and convergence guarantees do not require assumptions on data coverage, which relaxes the restrictions of previous works. 2 Preliminaries Finite-horizon Markov decision process (MDP). Finite-horizon MDPs are defined by the tuple M = (S, A, P, R, H, d0), where S is the state space, A is the action space, and H is the horizon. We use [H] = {0, . . . , H} and when clear from the context, use { h} = { h}h [H]. For notational compactness we assume that S = h Sh is the union of H disjoint sets {Sh}, each of which is the set of states reachable at timestep h. This is WLOG as we can always augment the state space with [H] at the cost of only H factors [JKALS17; MBFR24]. Since each state can only be visited at a single timestep, we can now define the (non-stationary) transitions as P : S A (S), and the initial state distribution as d0 (S0). We assume the reward function R : S [0, 1] is bounded on the unit interval and (for simplicity) state-wise deterministic. This sufficiently captures the challenges of our setting since the occupancies are densities over states, and it will be easily seen later that our results generalize to per-state-action rewards. A policy π : S (A) interacting with M observes trajectories {(sh, ah, sh+1, rh+1)}H 1 h=0 , and has expected return J(π) = Eπ[PH h=1 R(sh)]. At any (h, s, a), its expected return-to-go is encoded in the value function Qπ h(s, a) = Eπ[P h >h R(sh )|sh = s, ah = a]. For each h [H], a policy s occupancy function dπ h (S) is a p.d.f. describing its state visitation, dπ h(s) = Pπ(sh = s). In combination with the policy, the MDP dynamics dictate the evolution of the occupancy over timesteps. This is encoded in the recursive Bellman flow equation, which mandates that dπ h = Pπdπ h 1 for all h [H]. Here, Pπ is the Bellman flow operator with (Pπf)(s ) := P s,a P(s |s, a)π(a|s)f(s) R , for any function f : S R . Policy optimization. For an objective function f : ΠΘ R, the general goal of this work is to find argmaxπθ ΠΘ f(πθ) over a policy class ΠΘ = {πθ : θ Θ, θ Rp, θ B}, parameterized by a convex and closed parameter class Θ with dimension p. One example of f is the expected return J(πθ). Projected gradient ascent (PGA) will be our base algorithm for policy optimization. For a fixed learning rate η and iterations t [T], it iteratively updates θ(t+1) = ProjΘ θ(t) + η f(πθ(t)) . Here, f(πθ) = [ f(πθ) θp ]p [p] Rp is the gradient with respect to θ, where superscript p indexes the p-th entry of a vector. We will assume that the gradient of the policy s log-probability is bounded, as is ubiquitous in the PG literature [LSAB19; AKLM21]. Assumption 2.1. For all πθ ΠΘ, maxs,a log πθ(a|s) G. We will later analyze the convergence rate of our algorithms to stationary points with (approximately) zero gradient, and refer to π(t) = πθ(t) for short. For PGA, stationarity will be measured using the standard gradient mapping Gη(π(t)) with Gη(π(t)) := 1 η(θ(t) θ(t+1)), the parameter change between iterations [Bec17]. Note that if Θ = Rp and no projection is required, then Gη(π(t)) = f(π(t)) reduces to the gradient magnitude. Computational oracles. As is common in the literature, we analyze computational efficiency in terms of the number of calls to the following oracles, which serve as computational abstractions. We desire a polynomial number of such calls in terms of problem-relevant parameters. Given an i.i.d. dataset D = {(x, y)} and function class F, the maximum likelihood estimation oracle outputs argmaxf F ED[log f(x)]. The squared-loss regression oracle finds argminf F ED[(f(x) y)2]. Both can be approximated efficiently whenever optimizing over F is feasible [MHKL20; FR20]. 3 Online Occupancy-based PG We now develop our occupancy-based policy optimization algorithm for the online RL setting, where the policy can continuously interact with the environment to gather new trajectories. Our gradient estimation routine is based on a recursive Bellman flow-like equation that can be approximately solved using squared-loss regression, not unlike those used to estimate occupancy functions in FORC [HCJ23] or value functions in FQI [ASM07]. The intuitions established for our online algorithm form the foundation for our later offline methods. 3.1 Occupancy-based Policy Gradient The expected return of a policy π can be expressed as the expectation over its occupancy of the per-state rewards, J(π) = P sh dπ h(sh)R(sh). The gradient of J(π) then passes through dπ, sh dπ h(sh)R(sh) = P h Esh dπ h [ log dπ h(sh)R(sh)] . We use the grad-log trick above to write J(π) as an expectation over dπ, which makes it amenable to estimation from online samples as long as we can calculate log dπ h : S Rp. We make the key observation that log dπ h can be expressed as a function of log dπ h 1, which involves a time-reversed conditional expectation over the previous timestep s (sh 1, ah 1) given the current sh. Lemma 3.1. For any π and h [H], log dπ h satisfies the recursion log dπ h = Eπ h 1 log π + log dπ h 1 , (1) where [Eπ h 1f](s ) := Eπ[f(sh 1, ah 1)|sh = s ] = P s,a P (s |s,a)π(a|s)dπ h 1(s) dπ h(s ) f(s, a)1, for any function f : S A Rp. Further, under Asm. 2.1, maxs,h log dπ h(s) h G. Eq. (1) is derived by propagating the gradient through the Bellman flow equation, and we can solve it from h = 1 to H to compute log dπ h (with log dπ 0 = 0 by definition). While related observations have been made throughout the rich history of PG literature [CC97; MT01; KU20; XYWL21], the expression in Eq. (1) is adapted to our unique pursuit of modeling log dπ with general function approximators. In particular, the conditional expectation (Eπ) immediately hints that log dπ is amenable to estimation using squared-loss regression, a technique that is well-understood for value functions [SB18] and, more recently, for occupancy functions [HCJ23]. Formally, to solve the dynamic programming equation of Eq. (1) in a computationally efficient manner, we reduce it to minimizing a squared-loss regression problem. Consider the standard (supervised learning) regression setup. The solution to argminf E(x,y) Q[(f(x) y)2] maps x 7 EQ[y|x], the conditional expectation given x of the target y under the joint Q. As a result (see Lem. B.2), log dπ h = argming:S Rp Eπ h g(sh) log π(ah 1|sh 1) + log dπ h 1(sh 1) 2i . (2) 1We use the convention 0/0 = 0 for ratios between two functions. Algorithm 1 OCCUPG: Online Occupancy-based Policy Gradient Input: Samples n; iterations T; policy class ΠΘ; gradient function class {Gh}; learning rate η 1: for t = 0, . . . , T 1 do 2: Collect n trajectories with π(t). Set Dreg h = {(sh, ah, sh+1)}n i=1 for all h. Repeat for {Dgrad h }. 3: Initialize g0 = 0. 4: for h = 1, . . . , H do 5: Let L(t) h 1(gh; gh 1) := 1 (s,a,s ) Dreg h 1 gh(s ) log π(t)(a|s) + gh 1(s) 2. Set bg(t) h = argmingh Gh L(t) h 1(gh; bg(t) h 1). (3) 6: end for 7: Estimate b J(π(t)) = 1 (s,a,s ,r ) Dgrad h 1 bg(t) h (s ) r 8: Update θ(t+1) = ProjΘ θ(t) + η b J(π(t)) . Here, g is a vector-valued function, and the norm 2 is equivalent to the sum of p scalar-valued squared-losses for each parameter dimension. The RHS only requires sampling (sh 1, ah 1, sh) π from online rollouts. Then, given finite samples, we can robustly estimate log dπ by minimizing an empirical version of Eq. (2) using regression oracles. 3.2 Online policy gradient algorithm and analyses Alg. 1 (OCCUPG) displays our full online occupancy-based PG procedure. For each iteration t [T], we first collect two independent datasets: {Dreg h } for log dπ(t) estimation, and {Dgrad h } for J(π(t)) estimation. The former occurs in Line 5, where we recursively solve an empirical version of Eq. (2); the latter is computed in Line 7, then used to update the policy (Line 8). Gradient estimation guarantee. In the following, we establish that our regression-based estimation procedure produces accurate estimates of J(π). Our guarantee holds under the requirement that the gradient function classes {Gh} can express the population gradient update (Lem. 3.1) for any target function. It is analogous to the Bellman completeness assumption that is required for regression-based value or occupancy function estimation [CJ19; HCJ23]. Assumption 3.1 (Gradient function class completeness). For all h [H], supg Gh,s S gh(s) h G. Further, for all π ΠΘ, we have Eπ h 1( log π + gh 1) Gh, for all gh 1 Gh 1. Next, since we allow G to be a continuous function class, our sample complexity bound for gradient estimation is expressed in terms of its pseudodimension := d G (Def. F.1). Examples of G parameterizations and their d G are discussed in Rem. 3.1 below. Finally, Thm. 3.1 shows that OCCUPG produces accurate gradient estimates given the following polynomial sample size. Theorem 3.1. Fix δ (0, 1) and π ΠΘ. Under Asm. 2.1 and Asm. 3.1, we have that w.p. 1 δ, J(π) b J(π) ε when n = e O pd GH6G2 log(1/δ) Remark 3.1. Lastly, we provide examples of G for Asm. 3.1 in representative MDP structures. Lowrank MDPs (Def. B.1) are a well-studied setting where the transition function admits a low-rank decomposition into two features of rank k, i.e., there exists ϕ : S A Rk and µ : S Rk such that P(s |s, a) = ϕ(s, a), µ(s ) [JYWJ20b]. Tabular MDPs are a special case with onehot features. Due to the bilinear transitions, both the occupancy and its gradient are linear functions of µ, i.e., dπ = µ(s) ψ and dπ(s) = µ(s) Ψ for some Ψ Rk p, ψ Rk, and all s S. When µ is known, we can set Gh to be a linear-over-linear function class Gh = n gh(s) = µ(s) Ψ µ(s) ψ : Ψ Rk p, ψ Rk, maxs gh(s) h G o , which has d G = kp (Prop. B.1). Stationary convergence. Next, we analyze the convergence rate of OCCUPG to a stationary policy, i.e., one that has near-zero gradient. Note that, in general, stationary policies are not necessarily optimal as the objective function is non-convex. As is standard in the literature, we will assume that the objective has a smooth gradient [LSAB19; AKLM21]. Assumption 3.2 (β-smooth objective). For a function f : ΠΘ R , there exists β > 0 such that f(πθ) f(πθ ) 2 β θ θ 2 for all θ, θ Θ. Cor. 3.1 shows that, in expectation, OCCUPG with T = O(βH/ε) iterations outputs a ε-stationary point, as measured by Gη(π(t)) = 1 η θ(t) θ(t 1) . The proof relies on Thm. 3.1, i.e., with enough samples the statistical noise of the gradient estimates are sufficiently small to enable convergence. Corollary 3.1. Under Asm. 2.1, Asm. 3.1, and Asm. 3.2, the iterates of OCCUPG with T = O(βHε 1) and n = e O(pd GH6G2 log(T/δ)ε 1) satisfiy 1 T PT t=1 E[ Gη(π(t)) 2] ε. Computational efficiency. OCCUPG is not only statistically efficient but computationally oracleefficient as well, since it reduces to a series of squared-loss minimization problems. In each iteration, it makes H calls to a regression oracle to compute the occupancy gradient (Line 5). Then to converge to a ε-stationary point, from Cor. 3.1 we require a total of O(βH2/ε) such calls. Optimality. Lastly, we analyze when the policies recovered by OCCUPG are also approximately optimal. The key inequality is an upper bound on the suboptimality of any policy in terms of its gradient magnitude (or stationarity), and a coverage coefficient Cπ with respect to the optimal policy. Lemma 3.2. For any π and π , define Bπ(π ) := P h,s,a dπ h(s)π (a|s)Qπ h(s, a). Suppose π ΠΘ, 1. (Policy completeness) There exists π+ ΠΘ such that π+ argmaxπ Bπ(π ). 2. (Gradient domination) maxπ ΠΘ Bπ(π ) Bπ(π) m maxθ Θ Bπ(π), θ θ Given ν (S), define the coverage coefficient Cπ := P h dπ h /ν for π = argmaxπ J(π). Then for any πθ ΠΘ, J(π ) J(πθ) m Cπ max θ ΠΘ Jν(πθ), θ θ , (4) where Jν(π) := Es0 ν,π[P h rh] is the expected return of π in M with initial state distribution ν. The lemma preconditions are identical to those required for value-based analysis [BR24]. Bπ(π ) is a one-step improvement objective with respect to the occupancies and value functions of π, and we require (1) the policy class to be expressive enough that it contains any maximizer; and (2) the one-step objective to itself have optimality gap upper-bounded by the one-step policy gradient magnitude, for which the constant m is determined wholly by the policy parameterization. For example, the tabular policy πθ(a|s) = θsa has m = 1 [AKLM21]. The coverage coefficient Cπ is the finite-horizon counterpart to the infinite-horizon exploratory initial distribution salient to the analysis of [AKLM21] and [BR24] (which lists developing it as future work). In RL, a small gradient magnitude alone does not guarantee optimality, as it can also occur when the policy rarely visits rewarding states. The coverage coefficient quantifies both how policy performance can suffer from insufficient exploration, as well as how exploratory initializations mitigates this problem. Finally, combining Lem. 3.2 with the stationary convergence result in Cor. 3.1 shows that, on average, the best-iterate of OCCUPG is near-optimal. Corollary 3.2. Under the preconditions of Lem. 3.2 and Cor. 3.1, running OCCUPG2 with initial distribution ν satisfies E[mint J(π ) J(π(t))] ε when T = e O βB2(Cπ )2m2H2 e O B2(Cπ )2m2pd GH6G2 log(T ) 3.3 Optimization of general functionals One standout feature of OCCUPG is that it can, with a one-line change, be adapted for policy optimization of any (differentiable) objective function involving occupancies. We work with JF (π) = P h Fh(dπ h) as a representative formula, where Fh : (S) R is a general functional. Such objectives often evade value-based PG optimization because they do not admit value functions or Bellman-like recursions with which to compute them. Examples include entropy maximization 2This means that trajectories are generated by first sampling the initial state s1 ν, then rolling out the policy according to the true MDP s dynamics. where Fh(d) = d, log d ; imitation learning where Fh(d) = d dπE h 2 2 for an expert policy πE; and the expected return with Fh(d) = d, R [MDSDBR22]. The policy gradient is then JF (π) = P d(s) |d=dπ h log dπ h(s) i . Implementationwise, we need only change Line 7 in OCCUPG to accommodate the new gradient formula, to b JF (π) = 1 n P s Dh bgπ h(s) Fh(d) d(s) |d= b dπ h . The partial derivative of Fh is evaluated with a plug-in occupancy estimate bdπ that can be obtained using maximum likelihood estimation (App. D). Notably, the occupancy gradient estimation module for bgπ h log dπ h (Line 5) is reused verbatim. Given their resemblance to those in Sec. 3.2, the full algorithm and analyses are deferred to App. B.5. 4 Offline Occupancy-based PG In this section, we develop an algorithm for occupancy-based policy optimization in the offline setting, where only fixed datasets are available for learning. A direct modification of OCCUPG, e.g., by converting occupancies to density ratios over the offline data distribution, will fail unless the data covers all possible policies, otherwise the density ratio may be unbounded. In-line with recent state-of-the-art offline RL algorithms, our goal is to establish an offline PG algorithm that adapts to and retains meaningful guarantees under arbitrary offline datasets, for which our key consideration is establishing an offline gradient estimation method. We begin by defining these offline datasets. Definition 4.1. The offline dataset is D = {Dh}, where Dh = {(sh, ah, sh+1, rh+1))}n i=1 is generated i.i.d. as sh d D h for some d D h (S) and ah πD h ( |sh) in M, for a known behavior policy πD h . The marginal next-state distribution in Dh is denoted as d D, h (sh+1). Def. 4.1 is more general than the typical i.i.d. trajectory setting [KU20; NZJZW22], where d D h = d D, h 1. Crucially, unlike previous works that require lower-bounded d D or all-policy coverage [KU20; XYWL21; NZJZW22], we will make no assumptions about the quality of D with respect to ΠΘ. Additional notation. For short, we say EDh[ ] E(sh,ah,sh+1,rh+1) Dh[ ], and use (s, a, s , r ) Dh when clear from the context. For any g : S A Rp and reweighting function ρ : H S A R+, we define an offline reweighted analog to Eπ h (Lem. 3.1) for all h [H] to be [ED,ρ h g](s ) := E(s,a,s ) Dh ρh[g(s, a)|s ] = P s,a [Dh ρh](s,a,s ) P s,a[Dh ρh](s,a,s ) g(s, a). (5) The (time-reversed) conditional expectation is taken over [Dh ρh](s, a, s ) := P(s |s, a)d D h (s)πD h (a|s)ρh(s, a), the joint offline distribution re-weighted by ρh. While this may not be a valid density, its induced conditional distribution on (s, a|s ) always is, i.e., P s,a [Dh ρh](s,a,s ) P s,a[Dh ρh](s,a,s ) = 1. As an example, for a given π we have ED,ρ h = Eπ h when ρh(s, a) = dπ h(s)π(a|s) d D h (s)πD h (a|s) is the policy s density ratio and is well-defined. 4.1 Offline density-based policy gradient A policy s occupancy dπ may not be covered by arbitrary offline data (Def. 4.1), so neither its expected return J(π) = P h dπ h, R nor its gradient J(π) will be estimatable from D. As a result, there is no hope of recovering argmaxπ ΠΘ J(π). Our solution is to instead maximize return only on areas of the state space that are sufficiently covered by offline data, which is captured exactly by the recursively clipped occupancy dπ from [HCJ23]. It clamps the policy occupancy to preset multiples Cs h, Ca h of the offline data distribution, thereby representing only the sufficiently covered portion. Definition 4.2 (Recursively clipped occupancy). Let ( ) := min{ , }. Given clipping constants {Cs h, Ca h} 1, define the clipped policy to be πh = π Ca hπD h , and recursively define dπ h = P πh 1 dπ h 1 Cs h 1d D h 1 , h [H]. (6) Eq. (6) resembles the Bellman flow equation with clipped policy π, and acts on the previous-timestep dπ h 1 clipped to at most Cs h 1d D h 1. Above this threshold the occupancy is considered to be insufficiently covered for estimation, and Cs strikes a bias-variance tradeoff between the amount of clipped mass vs. distribution shift. The clipped occupancy s density ratio is always well-defined and bounded as dπ h/d D, h 1 Cs h 1Ca h 1, and we use it to define our (now learnable) offline objective, sh dπ h(sh)R(sh) = P dπ h(sh) d D, h 1(sh)R(sh) . For any fully covered policy with dπ h Cs h 1Ca h 1d D, h 1 for all h [H], we have dπ = dπ and J(π) = J(π). In this sense, argmaxπ J(π) will be at least as good as the best policy fully covered by offline data. Next, define the density ratio be wπ h := dπ h/d D, h 1. The gradient of J(π) is h EDh 1 wπ h(sh)R(sh) log dπ h(sh) . To calculate this gradient we must compute both wπ and log dπ; for the former, [HCJ23] provides a method that we will later call as a subroutine. Our focus is on computing log dπ h, which is enabled by the following recursive equation, which is an offline analog of Lem. 3.1. Lemma 4.1. For any π and all h [H], define ρπ h(s, a) := ( dπ h(s) Cs hd D h (s)) d D h (s) πh(a|s) πD h (a|s). Then log dπ h = ED, ρπ h 1 log π 1[π Ca hπD h 1] + log dπ h 1 1[ dπ h 1 Cs hd D h 1] , (7) where ED, ρπ h 1 is from Eq. (5), and [M v]( ) := v( )M( ) Rp for M : p and v : R. Lem. 4.1 is derived from applying the chain rule to Def. 4.2, and the clipped occupancies play an instrumental role in handling insufficient offline coverage. Notably, the indicator function zeroes-out both the gradients log π and log dπ h 1 where they are insufficiently covered, e.g., dπ h 1(s) > Cs h 1d D h 1(s). Further, under full offline coverage we recover Lem. 3.1 and log dπ = log dπ. Because the rewards are nonnegative, log dπ induces a pessimistic policy gradient that shifts policies away from out-of-distribution actions, even if they generate high return. This is seen more clearly in Prop. 4.1, that rearranges the resulting expression for J(π) into a value-based form: Proposition 4.1. We can equivalently write h EDh[ ρπ h(s, a) log πh(a|s) Qπ h(s, a)], where Qπ is a pessimistic value function that obeys the Bellman-like recursion Qπ h(s, a) = 1[π Ca hπD h ](a|s) P s P(s |s, a) R(s ) + 1[ dπ h+1 Cs h+1d D h+1](s ) Qπ h+1(s , πh+1) . In Qπ, future returns are zeroed out at states and actions that exceed the threshold of data coverage, due to indicators functions that are inherited from log dπ. Prop. 4.1 can be seen as a pessimistic offline analog to the classical PG theorem J(π) = P h Es,a dπ h [ log π(a|s)Qπ h(s, a)] [SMSM99], entirely induced by the definition of the clipped occupancy. Non-robustness of log dπ estimation to plug-in densities. With finite samples, however, it turns out that consistent estimates of log dπ h in Eq. (6) cannot be computed. To make this argument, we first outline the high-level gradient estimation procedure for a fixed policy: Estimate occupancies {bdπ h} and {bd D h } Compute b log dπ h using Eq. (7) with plug-in indicator function estimate 1[bdπ h 1 Cs h bd D h 1] The problem arises in step two, as 1[ ] is a stepwise function and not smooth. Even if bdπ is vanishingly close to dπ, the gradient calculated from plug-in occupancy estimates can have constant error. Proposition 4.2. There exists an MDP and policy π such that, for any ε > 0, maxh,s log dπ h(s) b log dπ h(s) = O(1) when dπ h bdπ h 1 ε and bd D h d D h 1 ε for all h. 4.2 Smooth clipping To resolve this issue, we will use a smooth-clipping function σ (x, c) to approximate the hard - clipping (x c) in Eq. (6), whose non-smooth gradient was the source of our estimation problems. Figure 1 plots 1-D examples of σ (x, c) against (x c) as reference (dashed), and Asm. 4.1 describes the properties of σ that enable our later estimation and convergence guarantees. 1.0 (x, c = 0.5) 1/2 1/4 1/8 1/16 0 I (x, c = 0.5) 2 4 8 16 Figure 1: We plot σ(x, c) from Prop. 4.3 for different b, that trade-off between clipping approximation error and smoothness (Dσ 1/Lσ). Assumption 4.1. Assume that σ satisfies x, x , c, c dom(σ), 1. (Approximate clipping) Dσ 0 such that 0 (x c) σ (x, c) Dσ (x c). 2. (Monotonicity) σ (x , c) σ (x, c) if x x; σ (x, c ) σ (x, c) if c c; and vice versa. 3. (Smooth gradient) Define the smoothed indicator 1 (x, c) := x x log σ (x, c), where x is the partial derivative w.r.t. x. Then 1 (x, c) [0, 1] and Lσ 0 s.t. x, x , c, c dom(σ), c| 1 (x, c) 1 (x , c) | Lσ|x x |, and x| 1 (x, c) 1 (x, c ) | Lσ|c c |. Note that σ (x, c) = (x c) is a special case with 1 (x, c) = 1[x c], thus Dσ = 0 and Lσ = . The following choice of σ, which is plotted in Fig. 1, fulfills Asm. 4.1. Proposition 4.3. For any b > 1, σ (x, c) = x b + c b 1/b has Lσ = b and Dσ = 1/b. Next, we define the smooth-clipped occupancy function edπ h, which is no larger than dπ h. Definition 4.3 (Recursively smooth-clipped occupancy). For smooth-clipping function σ satisfying Asm. 4.1 and clipping constants {Cs h, Ca h}, define eπh := σ π, Ca hπD h , and inductively set edπ h = Peπh 1 σ edπ h 1, Cs h 1d D h 1 , h [H]. (8) Then letting ewπ h := edπ h/d D, h 1, our new objective is e J(π) = P h EDh 1[ ewπ h(sh)R(sh)] with gradient e J(π) = P h EDh 1[ ewπ h(sh)R(sh) log edπ h(sh)], where log edπ h obeys the following recursion. Lemma 4.2. For σ satisfying Asm. 4.1, recall 1 (x, c) := x x log σ (x, c) . Then for all h [H], log edπ h = ED,eρπ h 1 log π 1 π, Ca h 1πD h 1 + log edπ h 1 1 edπ h 1, Cs h 1d D h 1 , (9) where eρπ h 1(s, a) := σ( e dπ h 1(s),Cs h 1d D h 1(s)) d D h 1(s) eπh 1(a|s) πD h 1(a|s) and ED,eρπ h 1 is defined in Eq. (5). Further, under Asm. 2.1, maxs,h log edπ h(s) h G. Eq. (9) replaces the (non-smooth) indicator function in log dπ (Lem. 4.1) with its smooth approximation 1, which, as we will show shortly, enables robust gradient estimation with plug-in occupancy estimates. As before, we can reduce it to squared-loss regression (Eq. (11)). Further, by optimizing e J(π), we also approximately maximize our target objective J(π), with bias proportional to Dσ. Proposition 4.4. Under Asm. 4.1, 0 maxπ ΠΘ J(π) maxπ ΠΘ e J(π) H2Dσ. 4.3 Offline smooth-clipped gradient estimation Alg. 2 describes the offline PG algorithm for optimizing e J(π). To reduce clutter, we have used log eπh := log π 1 π, Ca hπD h . First, OFF-OCCUPG estimates d D h 1 using MLE (details in App. D due to space constraints). Then, for each iteration t, it estimates the smooth-clipped occupancy edπ(t) h using FORC (adapted from [HCJ23], see App. E). This is plugged into a squaredloss regression problem approximating Eq. (9) to learn log ed(t) h (lines 8 to 10), then estimate e J(π(t)) (line 12). Algorithm 2 OFF-OCCUPG: Offline Occupancy-based Policy Gradient Input: data D; iters T; learning rate η; function classes ΠΘ, F, W, G; clipping constants {Cs h Ca h} 1: Split D equally into Dmle, DFORC, Dreg, Dgrad, each with n samples. 2: Estimate {bd D h , bd D, h } MLE Dmle, F // Alg. 4 3: for t = 0, . . . , T 1 do 4: Estimate { bw(t) h } FORC π(t), DFORC, W, {bd D h , bd D, h } 5: Set occupancy estimate bd(t) h = bw(t) h bd D, h 1 for all h [H]. 6: Initialize bg(t) 0 = 0. 7: for h = 1, . . . , H do 8: Set density ratio bρ(t) h 1 = eπ(t) h 1 πD h 1 σ b d(t) h 1,Cs h 1 b d D h 1 b d D h 1 . 9: Set gradient regression target by(t) h 1 = bg(t) h 1 1 bd(t) h 1, Cs h 1 bd D h 1 . 10: Let e L(t) h 1(g; y, ρ) := 1 (s,a,s ) Dreg h 1 ρ(s, a) g(s ) ( log eπ(t) h 1(a|s)+y(s)) 2. Solve bg(t) h = argmingh Gh e L(t) h 1(gh; by(t) h 1, bρ(t) h 1) (10) 11: end for 12: Set b e J(π(t)) = 1 (s,a,s ,r ) Dgrad h bw(t) h (s ) bg(t) h (s ) r 13: Update θ(t+1) = Projθ(θ(t) + η b e J(π(t))). 14: end for Before stating the estimation guarantee for e J(π), we first introduce the required assumptions. For simplicity, we assume that the function classes used in MLE and FORC are finite, and defer their guarantees to the respective appendices, as they have been well-established in previous papers [AKKS20; HCJ23]. We focus on discussing Asm. 4.2 for the offline gradient function class, which requires a stronger level of expressiveness. Since the regression target in OFF-OCCUPG involves plug-in occupancy estimates, the completeness condition naturally requires Gh to express the gradient update in Lem. 4.2 for all possible targets composed of functions from Fh 1, Wh 1, Gh 1. As a result, Asm. 4.2 is generally stronger than Asm. 3.1 for OCCUPG. Assumption 4.2. For all h, supg Gh gh h G; and for all (π, g, f, f , w) ΠΘ Gh 1 Fh Fh 1 Wh 1, we have ED,ρ h 1( log eπh 1 + g 1 wf , Cs h 1f ) Gh, where ρ = σ(wf ,Cs hf) f eπh 1 πD h 1 . When the underlying MDP has favorable structure, however, we can expect that d G is not much larger than was required for OCCUPG. This is indeed the case in low-rank MDPs, where the G defined in Rem. 3.1 also satisfies Asm. 4.2 (proof in Prop. C.1). Due to the bilinear transition structure, the offline gradient update (Lem. 4.2) applied to any target remains a linear-over-linear function. The guarantees for the MLE (Alg. 4) and weight estimation (Alg. 5) subroutines require Asm. D.1 and Asm. E.1, respectively, which are included in the preconditions of the main result below. Briefly, Asm. D.1 requires F to realize the true data distributions d D h and d D, h , which is standard in supervised learning. Asm. E.1 requires W to be closed under the Bellman flow operator, and can be viewed as a 1-dimensional version of Asm. 4.2 where ρ = 1. In this sense both assumptions are weaker requirements on expressivity than that of the gradient class in Asm. 4.2, and more detailed discussions are left to App. D and App. E. Having established its preconditions, we now present our main estimation guarantee for OFFOCCUPG, which pays additional factors for the coverage of offline data (P h Cs h Ca h) and the smoothness of σ. Theorem 4.1. Suppose e J( ) satisfies Asm. 3.2 and fix π ΠΘ. Under Asm. 2.1, Asm. 4.1, Asm. 4.2, Asm. D.1, and Asm. E.1, w.p. 1 δ we have e J(π) b e J(π) ε when n = e O pd GH6G2( P h Cs h Ca h) 2L2 σ log(|W||F|/δ) ε2 Stationary convergence & computational efficiency. Similar to OCCUPG, OFF-OCCUPG with T = O(βH2/ε2) converges to an ε-stationary point. The formal statement is given in Cor. C.1 and is based on the estimation guarantee in Thm. 4.1. As a result, OFF-OCCUPG is also computationally oracle-efficient. Each invocation of MLE involves 2H calls to a likelihood maximization oracle (see Alg. 4), and each invocation of FORC requires H calls to a squared-loss regression oracle (see Alg. 5). Then local convergence is still achieved with O(βH2/ε2) such calls, as increasing T further cannot reduce error from statistical noise (that depends only on the fixed n). Optimality. Analyzing the conditions under which offline PG recovers global optima is more challenging, as we can no longer utilize exploratory initialization (from Cor. 3.2). However, since all occupancies have been clipped to the data distribution, we show in App. C.5 that the offline data itself can sometimes suffice as an exploratory initial distribution, and the corresponding bound is in terms of {Cs h} (instead of the online Cπ ). However, this is not guaranteed in general and our current result only holds under strong all-policy offline data coverage. Briefly, some hardness comes from the fact that clipping causes gradient signals to vanish, so a stationary policy might be far offsupport, rather than optimal. Investigating the possibility of more relaxed conditions for offline PG convergence (or, conversely, refining hardness results) are especially interesting directions for future work. 5 Conclusion For the first time, we demonstrate how policy optimization can be conducted with (only) occupancy functions for both online and offline RL, and comprehensively analyze both local and global convergence. In the online setting our method directly extends to optimizing general objective functionals that cannot be optimized using value-based methods, and in the offline setting the occupancy-based gradient naturally handles incomplete offline data coverage. As our work is the first in this line of research and theoretical in nature, for future work we plan to launch empirical investigations of our methods, especially those for optimizing general functionals. Additionally, the conditions under which offline PG can converge to global optima is not well-understood, and we hope that our preliminary results here encourage greater interest and investigation into this question. Acknowledgements Nan Jiang acknowledges funding support from NSF IIS-2112471, NSF CAREER IIS-2141781, Google Scholar Award, and Sloan Fellowship. [AFK24] Philip Amortila, Dylan J Foster, and Akshay Krishnamurthy. Scalable Online Exploration via Coverability . In: ar Xiv preprint ar Xiv:2403.06571 (2024). [AKKS20] Alekh Agarwal, Sham Kakade, Akshay Krishnamurthy, and Wen Sun. Flambe: Structural complexity and representation learning of low rank mdps . In: Advances in Neural Information Processing Systems (2020). [AKLM21] Alekh Agarwal, Sham M Kakade, Jason D Lee, and Gaurav Mahajan. On the theory of policy gradient methods: Optimality, approximation, and distribution shift . In: The Journal of Machine Learning Research 22.1 (2021), pp. 4431 4506. [ASM07] András Antos, Csaba Szepesvári, and Rémi Munos. Fitted Q-iteration in continuous action-space MDPs . In: Advances in neural information processing systems 20 (2007). [Bec17] Amir Beck. First-order methods in optimization. SIAM, 2017. [BFH23] Anas Barakat, Ilyas Fatkhullin, and Niao He. Reinforcement learning with general utilities: Simpler variance reduction and large state-action space . In: International Conference on Machine Learning. PMLR. 2023, pp. 1753 1800. [BR24] Jalaj Bhandari and Daniel Russo. Global optimality guarantees for policy gradient methods . In: Operations Research (2024). [CC97] Xi-Ren Cao and Han-Fu Chen. Perturbation realization, potentials, and sensitivity analysis of Markov processes . In: IEEE Transactions on Automatic Control 42.10 (1997), pp. 1382 1393. [CJ19] Jinglin Chen and Nan Jiang. Information-Theoretic Considerations in Batch Reinforcement Learning . In: International Conference on Machine Learning. 2019. [CJ22] Jinglin Chen and Nan Jiang. Offline reinforcement learning under value and density-ratio realizability: the power of gaps . In: Uncertainty in Artificial Intelligence. PMLR. 2022, pp. 378 388. [DWS12] Thomas Degris, Martha White, and Richard S Sutton. Off-policy actor-critic . In: ar Xiv preprint ar Xiv:1205.4839 (2012). [FR20] Dylan Foster and Alexander Rakhlin. Beyond ucb: Optimal and efficient contextual bandits with regression oracles . In: International Conference on Machine Learning. PMLR. 2020, pp. 3199 3210. [GL16] Saeed Ghadimi and Guanghui Lan. Accelerated gradient methods for nonconvex nonlinear and stochastic programming . In: Mathematical Programming 156.1-2 (2016), pp. 59 99. [HCJ23] Audrey Huang, Jinglin Chen, and Nan Jiang. Reinforcement Learning in Low Rank MDPs with Density Features . In: ar Xiv preprint ar Xiv:2302.02252 (2023). [HDGP24] Jia Lin Hau, Erick Delage, Mohammad Ghavamzadeh, and Marek Petrik. On Dynamic Programming Decompositions of Static Risk Measures in Markov Decision Processes . In: Advances in Neural Information Processing Systems 36 (2024). [HJ22a] Audrey Huang and Nan Jiang. Beyond the Return: Off-policy Function Estimation under User-specified Error-measuring Distributions . In: Advances in Neural Information Processing Systems. 2022. [HJ22b] Jiawei Huang and Nan Jiang. On the convergence rate of off-policy policy optimization methods with density-ratio correction . In: International Conference on Artificial Intelligence and Statistics. PMLR. 2022, pp. 2658 2705. [HKSVS19] Elad Hazan, Sham Kakade, Karan Singh, and Abby Van Soest. Provably efficient maximum entropy exploration . In: International Conference on Machine Learning. PMLR. 2019, pp. 2681 2691. [HM17] Assaf Hallak and Shie Mannor. Consistent on-line off-policy evaluation . In: International Conference on Machine Learning. PMLR. 2017, pp. 1372 1383. [JKALS17] Nan Jiang, Akshay Krishnamurthy, Alekh Agarwal, John Langford, and Robert E Schapire. Contextual decision processes with low Bellman rank are PAClearnable . In: International Conference on Machine Learning. 2017. [JLM21] Chi Jin, Qinghua Liu, and Sobhan Miryoosefi. Bellman Eluder dimension: New rich classes of RL problems, and sample-efficient algorithms . In: Advances in Neural Information Processing Systems. 2021. [JYWJ20a] Chi Jin, Zhuoran Yang, Zhaoran Wang, and Michael I Jordan. Provably efficient reinforcement learning with linear function approximation . In: Conference on Learning Theory. 2020. [JYWJ20b] Chi Jin, Zhuoran Yang, Zhaoran Wang, and Michael I Jordan. Provably efficient reinforcement learning with linear function approximation . In: Conference on learning theory. PMLR. 2020, pp. 2137 2143. [KJDC24] Pulkit Katdare, Anant Joshi, and Katherine Driggs-Campbell. Towards Provable Log Density Policy Gradient . In: ar Xiv preprint ar Xiv:2403.01605 (2024). [KU20] Nathan Kallus and Masatoshi Uehara. Statistically efficient off-policy policy gradients . In: International Conference on Machine Learning. PMLR. 2020, pp. 5089 5100. [LLTZ18] Qiang Liu, Lihong Li, Ziyang Tang, and Dengyong Zhou. Breaking the curse of horizon: Infinite-horizon off-policy estimation . In: Advances in neural information processing systems 31 (2018). [LNSJ23] Qinghua Liu, Praneeth Netrapalli, Csaba Szepesvari, and Chi Jin. Optimistic mle: A generic model-based algorithm for partially observable sequential decision making . In: Proceedings of the 55th Annual ACM Symposium on Theory of Computing. 2023, pp. 363 376. [LSAB19] Yao Liu, Adith Swaminathan, Alekh Agarwal, and Emma Brunskill. Offpolicy policy gradient with state distribution correction . In: ar Xiv preprint ar Xiv:1904.08473 (2019). [MBFR24] Zak Mhammedi, Adam Block, Dylan J Foster, and Alexander Rakhlin. Efficient model-free exploration in low-rank mdps . In: Advances in Neural Information Processing Systems 36 (2024). [MDSDBR22] Mirco Mutti, Riccardo De Santi, Piersilvio De Bartolomeis, and Marcello Restelli. Challenging common assumptions in convex reinforcement learning . In: Advances in Neural Information Processing Systems 35 (2022), pp. 4489 4502. [MHKL20] Dipendra Misra, Mikael Henaff, Akshay Krishnamurthy, and John Langford. Kinematic state abstraction and provably efficient rich-observation reinforcement learning . In: International conference on machine learning. 2020. [MT01] Peter Marbach and John N Tsitsiklis. Simulation-based optimization of Markov reward processes . In: IEEE Transactions on Automatic Control 46.2 (2001), pp. 191 209. [NCDL19] Ofir Nachum, Yinlam Chow, Bo Dai, and Lihong Li. Dualdice: Behavior-agnostic estimation of discounted stationary distribution corrections . In: Advances in neural information processing systems 32 (2019). [NDKCLS19] Ofir Nachum, Bo Dai, Ilya Kostrikov, Yinlam Chow, Lihong Li, and Dale Schuurmans. Algaedice: Policy gradient from arbitrary experience . In: ar Xiv preprint ar Xiv:1912.02074 (2019). [NZJZW22] Chengzhuo Ni, Ruiqi Zhang, Xiang Ji, Xuezhou Zhang, and Mengdi Wang. Optimal Estimation of Policy Gradient via Double Fitted Iteration . In: International Conference on Machine Learning. PMLR. 2022, pp. 16724 16783. [SB18] Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction. MIT press, 2018. [SMSM99] Richard S Sutton, David Mc Allester, Satinder Singh, and Yishay Mansour. Policy gradient methods for reinforcement learning with function approximation . In: Advances in neural information processing systems 12 (1999). [UHJ20] Masatoshi Uehara, Jiawei Huang, and Nan Jiang. Minimax weight and q-function learning for off-policy evaluation . In: International Conference on Machine Learning. PMLR. 2020, pp. 9659 9668. [UIJKSX21] Masatoshi Uehara, Masaaki Imaizumi, Nan Jiang, Nathan Kallus, Wen Sun, and Tengyang Xie. Finite sample analysis of minimax offline reinforcement learning: Completeness, fast rates and first-order efficiency . In: ar Xiv preprint ar Xiv:2102.02981 (2021). [XCJMA21] Tengyang Xie, Ching-An Cheng, Nan Jiang, Paul Mineiro, and Alekh Agarwal. Bellman-consistent pessimism for offline reinforcement learning . In: Advances in neural information processing systems 34 (2021). [XFBJK22] Tengyang Xie, Dylan J Foster, Yu Bai, Nan Jiang, and Sham M Kakade. The role of coverage in online reinforcement learning . In: ar Xiv preprint ar Xiv:2210.04157 (2022). [XJ21] Tengyang Xie and Nan Jiang. Batch value-function approximation with only realizability . In: International Conference on Machine Learning. 2021. [XYWL21] Tengyu Xu, Zhuoran Yang, Zhaoran Wang, and Yingbin Liang. Doubly robust off-policy actor-critic: Convergence and optimality . In: International Conference on Machine Learning. PMLR. 2021, pp. 11581 11591. [ZBWK20] Junyu Zhang, Amrit Singh Bedi, Mengdi Wang, and Alec Koppel. Cautious reinforcement learning via distributional risk in the dual domain . In: ar Xiv preprint ar Xiv:2002.12475 (2020). [ZHHJL22] Wenhao Zhan, Baihe Huang, Audrey Huang, Nan Jiang, and Jason Lee. Offline reinforcement learning with realizability and single-policy concentrability . In: Conference on Learning Theory. PMLR. 2022, pp. 2730 2775. [ZLKB20] Andrea Zanette, Alessandro Lazaric, Mykel J Kochenderfer, and Emma Brunskill. Provably efficient reward-agnostic navigation with linear value iteration . In: Advances in Neural Information Processing Systems. 2020. A Related work 14 B Additional results and proofs for Sec. 3 15 B.1 Proofs for Sec. 3.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 B.2 Proofs for Sec. 3.2 : Estimation and local convergence . . . . . . . . . . . . . . . 16 B.3 Proofs for Sec. 3.2: Global convergence . . . . . . . . . . . . . . . . . . . . . . . 17 B.4 Examples of gradient function class G . . . . . . . . . . . . . . . . . . . . . . . . 19 B.5 Policy optimization of general functionals . . . . . . . . . . . . . . . . . . . . . . 20 C Additional results and proofs for Sec. 4 21 C.1 Proofs for Sec. 4.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 C.2 Proofs for Sec. 4.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 C.3 Proofs for Sec. 4.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 C.4 Local convergence of OFF-OCCUPG . . . . . . . . . . . . . . . . . . . . . . . . . 30 C.5 Global convergence of OFF-OCCUPG . . . . . . . . . . . . . . . . . . . . . . . . 32 C.6 Proofs for App. C.5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 D Maximum Likelihood Estimation 38 E Offline Density Estimation 39 F Probabilistic Tools 41 G Optimization Tools 43 A Related work In this section, we discuss related works in greater detail that concern the convergence and estimation of policy gradient in RL. While a handful of recent papers have similarly observed that the gradient of the log density can be utilized to compute the policy gradient, especially in the context of using it to optimize general functionals, none of them have analyzed methods that are sample-efficient under general function approximation. In particular, [KJDC24] requires on-policy sampling from the time-reversed transition (s, a|s ), which, as they note, is highly restrictive. To overcome this issue they propose a min-max algorithm that converges under linear approximation, which is computationally a far more difficult to solve (under a more stringent structural assumption) than the regression objective in Alg. 1. Similarly, [BFH23] consider only online policy gradient, and to handle large state spaces they use linear function approximation, which may incur a large error through bias in many settings. [ZBWK20] approach the problem of optimizing risk functionals through a primal-dual approach that involves occupancies as dual variables, but they only analyze convergence in tabular settings. A number of works on off-policy gradient optimization utilize all three of the density ratio, value, and policy class functions to compute the gradient [NDKCLS19; HJ22b; UIJKSX21; XYWL21]. This is because the density ratios are required to handle distribution mismatches with offline data. The downside, however, is that even max-min optimization is difficult, so performing optimization over all three functions requires complex optimization loops. By using simply projected gradient ascent on a policy class, our algorithms avoid such complexities and are amenable to classical convergence analysis that allow us to focus on the role of coverage coefficients in our final results. Because the density ratio is generally not well-defined with arbitrary offline data, all of these works require some form of all-policy coverage for both estimation and convergence guarantees. The weight gradient calculation in [UIJKSX21] exhibits a recursive decomposition that is related to ours. However, their formulation is not compatible with our data assumptions and they require policy coverage to be well-defined. A follow-up paper in [XYWL21] uses squared-loss regression on the same updates, which is similar in flavor to our gradient estimation objectives. However, they use linear function approximation for weight functions, which is not realizable in general, and also require all-policy coverage for their convergence results. One close work of comparison is [LSAB19], whose PG algorithm uses learned density ratios to reweight the data distribution and approximate the expression of the policy gradient theorem [SMSM99]. To handle coverage issues in offline PG, they zero out portions of the trajectory that exceed data coverage, but only do this for (s, a) such that d D(s)πD(a|s) = 0. This is done (in the infinite horizon setting) by resampling the dataset based on an augmented MDP where such (s, a) transition to an absorbing state However, this does not control the (potentially extremely large) variance of the estimator, e.g., on states where d D(s) 0. Their objective can be seen as a special case of J(π) with finite but extremely large choices of Cs and Ca. They show convergence to a stationary point in terms of weight and value estimation errors that are left implicit, and leave the high variance and coverage issues with offline data implicit. Another is [DWS12], that takes the complementary approach and simply calculates the gradient averaged on d D. However, this is a biased gradient object and does not express the policy gradient of any specific function, which means its stationary point may not even exist, thus precluding convergence analysis. The PSPI algorithm in [XCJMA21] is a policy optimization algorithm based on pessimistic value functions. Their setting is somewhat orthogonal to ours in the sense that they study values and we study occupancies, and we note that they do not perform policy optimization with respect to a standalone policy class but rather an implicit one induced by the value functions, which an be extremely large. In the value function sphere, [NZJZW22] leverage the linear structure of linear MDPS to develop closed-form gradient estimators through the value functions. They largely only analyze estimation errors and additionally require a form of all-policy coverage for their results. Lastly, our optimality analysis builds off the results in [BR24] and [AKLM21] that analyze global optimality in the (infinite horizon) online setting. B Additional results and proofs for Sec. 3 B.1 Proofs for Sec. 3.1 Proof of Lem. 3.1 First we expand dπ h using the Bellman flow equation, dπ h(s ) = P s,a P(s |s, a)π(a|s)dπ h 1(s): dπ h(s ) = P s,a P(s |s, a)( π(a|s)dπ h 1(s) + π(a|s) dπ h 1(s)) s,a P(s |s, a)π(a|s)dπ h 1(s)( log π(a|s) + log dπ h 1(s)). In the last line above we use the grad-log trick. Note that log dπ(s) is not well-defined when dπ(s) = 0, but the two terms will cancel out in the above expression for this case. From Bayes theorem, Pπ(sh 1 = s, ah 1 = a|sh = s ) = P(s |s, a)π(a|s)dπ h 1(s)/dπ h(s ), thus log dπ h(s ) = dπ h(s )/dπ h(s ) = Eπ[ log π(a|s) + log dπ h 1(s)|sh = s ] = [Eπ h 1( log π + log dπ h 1)](s ). We use the convention that 0/0 = 0, thus log dπ h is always well-defined. Lastly, the second statement results from Lem. B.1, which shows that log dπ h(s ) is always bounded and well-defined under Asm. 2.1. Lemma B.1. Under Asm. 2.1, we have maxs,h log dπ h(s) h G. Proof. The lemma statement can be derived inductively starting from the observation that Eq. (1) is an expectation over its target functions. As a result, the maximum gradient magnitude should accrue additively over horizons. More concretely, fix h and s. Then log edπ h(s ) = Eπ[ log π(a|s) + log dπ h 1(s)|sh = s ] log π(a|s) + log dπ h 1(s) G + log dπ h 1(s) , using Asm. 2.1 in the last line. Since log dπ 0 = 0 by definition, unrolling the above recursion through timesteps gives the stated result. Lemma B.2. For g : S Rp and f : S A Rp, define the squared loss Lh(g; f, π) = Eπ[ g(sh+1) f(sh, ah) 2]. Then for any such f, Eπ h(f) = argmin g:S Rp Lh(g; f, π) and log dπ h+1 = argmin g:S Rp Lh (g; log π + log dπ h, π) . Proof of Lem. B.2. Since the objective is convex, can solve for the minimizer in closed form by taking the derivative and setting it to 0 in an element-wise manner. Fix s . Taking the gradient of Lh(g; f, π) with respect to g(s ), we have that 0 = dπ h(s )g(s ) X s,a P(s |s, a)π(a|s)dπ h 1(s)f(s, a). Rearranging and using the definition of Eπ h gives the result. The second statement follows from Lem. 3.1. B.2 Proofs for Sec. 3.2 : Estimation and local convergence Proof of Thm. 3.1 First we split up the errors contributed by regression and the estimation. Fix π, then EDreg[b J(π)] = P h Es dπ h[bgπ h(s)Rh(s)] and J(π) b J(π) J(π) EDreg[b J(π)] + EDreg[b J(π)] b J(π) The first term is related to the regression error in bgπ h approximating log dπ h, J(π) EDreg[b J(π)] = P h Edπ h[ log dπ h(s)Rh(s)] Edπ h[bgπ h(s)Rh(s)] h Edπ h[ log dπ h(s) bgπ h(s)] = P h q Pp p=1 p log dπ h [bgπ h]p 2 1,dπ h. For a fixed h and p, we recursively decompose p log dπ h [bgπ h]p 1,dπ h p log dπ h [Eπ h 1( log π + bgπ h 1)]p 1,dπ h + [Eπ h 1( log π + bgπ h 1)]p [bgπ h]p 1,dπ h p log dπ h 1 [bgπ h 1]p 1,dπ h 1 + [Eπ h 1( log π + bgπ h 1)]p [bgπ h]p 2,dπ h, using the fact that log dπ h = Eπ h 1( log π + log dπ h 1) in the second line. Then unrolling the recursion, we have p log dπ h [bgπ h]p 1,dπ h X h [Eπ h 1( log π + bgπ h 1)]p [bgπ h]p 2,dπ h Applying Lem. F.2 (more exactly, this is an offline version but we invoke it with ρ = 1 and no clipping for the online setting) with δ = δ/2Hp and a union bound over all h and p, we have [Eπ h 1( log π + bgπ h 1)]p [bgπ h]p 2 2,dπ h = E[Lp Dreg h 1(bgπ h, bgπ h 1) Lp Dreg h 1(Eπ h 1( log π + bgπ h 1), bgπ h 1)] 2(εreg h 1)2, where εreg h 1 = q cd Gh2G2 log(2Hp/δ) n . Then for any h, p we have p log dπ h [bgπ h]p 1,dπ h 2cd Gh4G2 log(2Hp/δ) J(π) EDreg[b J(π)] p H p log dπ H [bgπ H]p 1,dπ H = 2cpd GH6G2 log(2Hp/δ) For the second term, |EDreg[b p J(π)] b p J(π)| X h |Es dπ h[bgπ h(s)Rh(s)] 1 s Dh bgπ h(s)Rh(s)| H2G log(2p H/δ) where we use Hoeffding s inequality with union bound, for all h [H] and p p in the last line, given that the randomness of bg is independent given Dreg h . Thus EDreg[b J(π)] b J(π) H2G p log(2p H/δ) n Combining the two terms, our final bound is J(π) b J(π) pd GH6G2 log(2Hp/δ) with the regression error dominating. Proof of Cor. 3.1 For any fixed run of Alg. 1, calling Thm. 3.1 for π(t) with δ = δ/T and taking a union bound over T gives with probability at least 1 δ that J(π(t)) b J(π(t)) pd GH6G2 log(2Hp T/δ) Then setting δ = 1/ n, we have E h J(π(t)) b J(π(t)) i pd GH6G2 log(2Hp Tn) where the expectation is over random samples in Dreg, Dest. Finally, plugging this into the PGD stationary convergence bound in Lem. G.1 gives t E h Gη(π(t), J(π(t))) 2i 4βH T + 6pd GH6G2 log(2Hp Tn) Setting the RHS to ε and setting T, n appropriately gives the result. B.3 Proofs for Sec. 3.2: Global convergence We will establish the conditions under which J(π) satisfies a gradient domination property, meaning that for any θ Θ, the suboptimality of πθ is bounded by some function S that includes a measure of its stationarity, i.e., maxπ ΠΘ e J(π ) J(πθ) S( e J(πθ)). This combined with the sample complexity bounds for stationary convergence established in Cor. 3.1 enables our global optimality result in Cor. 3.2. Though we are concerned with optimizing J(π) induced by running π starting from initial distribution d0, it will be useful to consider performing Alg. 1 using a different exploratory initial distribution µ (S). By exploratory, we mean that we allow µ(s) > 0 for all s S, unlike d0 (S0). In the (stationary) infinite horizon this is a common trick for obtaining well-defined gradient domination bounds [AKLM21], but its finite-horizon (nonstationary) counterpart is nontrivial and to our knowledge has not previously been formalized (it is listed as future work in [BR24]). We state and prove a more general version of Lem. 3.2: Lemma B.3. For any π and π , define Bπ(π ) := P h,s,a dπ h(s)π (a|s)Qπ h(s, a). Suppose π ΠΘ, 1. (Policy completeness) There exists π+ ΠΘ such that π+ argmaxπ Bπ(π ). 2. (Gradient domination) maxπ ΠΘ Bπ(π ) Bπ(π) m maxθ Θ Bπ(π), θ θ Given ν (S), define the coverage coefficient Cπ := P h dπ h /ν for π = argmaxπ J(π). Then for any πθ ΠΘ, J(π ) J(πθ) m max θ ΠΘ Jµ(πθ), θ θ Cπ max θ ΠΘ Jν(πθ), θ θ , where Jν(π) := Es0 ν,π[P h rh] is the expected return of π in M with initial state distribution ν. Proof of Lem. B.3 First we note two facts that hold regardless of M. We have Qπ g (sh, ah) = Qπ h(sh, ah) for any g h, and dπ g (sh) = 0 if g > h. J(π ) J(πθ) = s,a dπ h (s) (π (a|s) πθ(a|s)) Qπθ h (s, a) Then we will write Qπ(s, a) Qπ h(s, a), thus J(π ) J(πθ) = h=0 dπ h (s) (π (a|s) πθ(a|s)) Qπθ(s, a) (π (a|s) πθ(a|s)) Qπθ(s, a) ! π+(a|s) πθ(a|s) Qπθ(s, a) h dπ h (s) P h dπθ µ,h(s) h dπθ µ,h(s) ! π+(a|s) πθ(a|s) Qπθ(s, a) h dπθ µ,h(s) ! π+(a|s) πθ(a|s) Qπθ(s, a) For the RHS, observe that dπθ µ,g(sh) = 0 for g > h. Then X s,a dπθ µ,h(s) π+(a|s) πθ(a|s) Qπθ(s, a) s,a dπθ µ,h(s) π+(a|s) πθ(a|s) Qπθ h (s, a) s,a dπθ µ,h(s) π+(a|s) πθ(a|s) Qπθ µ,h(s, a) = Bπθ(π+) Bπθ(πθ) m max θ Θ Bπθ(πθ), θ θ = m max θ Θ Jµ(πθ), θ θ Combining the two inequalities results in the final guarantee. Proof of Cor. 3.2 Fix {π(t)}t [T ] from Alg. 1. Then for any t [T], from Lem. 3.2 we have J(π ) J(π(t)) m Cπ max θ ΠΘ D J(π(t)), θ θ E Bm Cπ Gη(π(t)) (Lem. G.4) Then summing through T and taking an expectation over the randomness in the algorithm, we have t J(π ) J(π(t)) T + 6pd GH6G2 log(2Hp Tn) . (Cor. 3.1) B.4 Examples of gradient function class G This section contains formal statements of the claims in Rem. 3.1, and their proofs. We begin by defining the low-rank MDP, noting that for notational compactness we have dropped the features h-dependence given our assumption that there is a one-to-one correspondence between states and the timestep at which they are visited. Definition B.1 (Low-rank MDP). We say M is a low-rank MDP with dimension k if h [H], there exists ϕ : S A Rk and µh : S Rk such that (s, a, s ), we have P(s |s, a) = ϕ(x, a), µ(x ) . Further, ϕ Cϕ and P Prop. B.1 shows that, in low-rank MDPs, a linear-over-linear parameterization for the gradient function class satisfies the completeness requirement in Asm. 3.1, with pseudo-dimension linear in the low-rank dimension and the parameter dimension, i.e., d Gh = O(kp). Proposition B.1. Suppose M is a low-rank MDP (Def. B.1), and suppose µ is known. For each layer h, define the function class Gh = gh = µ Ψ µ ψ : Ψ Rk p, ψ Rk, gh h G, h [H] . Then {Gh} satisfies Asm. 3.1 and has pseudodimension (Def. F.1) d Gh = O(kp). Proof of Prop. B.1. It suffices to show that, for any function f : S A Rp and policy π, its gradient update from Lem. 3.1 is Eπ h( log π + f) Gh+1. Since [Eπ h( log π + f)](s ) = Eπ[ log π(s, a) + f(s)|s ], from Bayes rule and the definition of the Bellman flow operator (see proof of Lem. 3.1), we have [Eπ h( log π + f)](s ) = [Pπ h 1( log π + f)](s ) First, we will show that [Pπ hf](s ) = µ(s ) Ψ for some Ψ Rk p and all s S. Below, we use f p(s) to denote the p-th parameter of f(s) Rp. For fixed p [p], [Pπ hf]p(s ) = X s,a P(s |s, a)π(a|s) ( p log π(a|s) + f p(s)) s,a ϕ(s, a)π(a|s) ( p log π(a|s) + f p(s)) where ψ = P s,a ϕ(s, a)π(a|s) ( log π(a|s) + f p(s)) Rk. Stacking this result for each p into the matrix Ψ shows the desired statement that [Pπ hf](s ) = µ(s ) Ψ. We can apply similar reasoning as above in the Bellman flow equation to show that dπ h(s ) = µ(s ), ψ for some θ Rk. Combined with the above, this shows that log dπ h(s ) = µ Ψ Combining the linear forms of the numerator and denominator reveal that log dπ h Gh. Lastly, the pseudo-dimension of Gh follows directly from applying Lemma 24 of [HCJ23], which bounds the pseudo-dimension of linear-over-linear function classes with p = 1, in all p dimensions. Algorithm 3 Online Occupancy-based PG for General Functionals Input: Functional F = {Fh}; Samples n; iterations T; policy class ΠΘ; function class G; learning rate η; function class F; 1: for t = 0, . . . , T 1 do 2: For all h [H], deploy π(t) for 3n trajectories. Set Dreg h = {(sh, ah, sh+1)}n i=1, and similarly for Dgrad h and Dmle h 3: for h = 1, . . . , H 1 do 4: Define L(t) h 1(gh, gh 1) := 1 n P (s,a,s ) Dreg h 1 gh(s ) log π(t)(a|s) + gh 1(s) 2, and set bg(t) h = argmingh Gh L(t) h 1(gh, bg(t) h 1). 5: end for 6: Estimate bdπ h MLE(Dmle, F). // Alg. 4 7: Estimate b JF (π) = 1 s Dest h bgπ h(s) Fh(d) d(s) d= b dπ h 8: Update θ(t+1) = ProjΘ θ(t) + η b J(π(t)) . B.5 Policy optimization of general functionals Alg. 3 displays the full algorithm for optimization of general functions (described in Sec. 3.3). It shares its occupancy gradient estimation module with OCCUPG. Compared to Alg. 1, the only change is the objective gradient calculation in Line 7, which uses a plug-in estimate of the occupancy (Line 6) to evaluate the partial derivative. Since the algorithmic change is small, the analysis for Alg. 3 requires only a few adaptations from the analysis of OCCUPG. For smooth and differentiable functionals, we provide the gradient estimation guarantee below. The smoothness ensures that using plug-in occupancy estimates to evaluate the partial derivative leads to consistent gradient estimates, and is in line with the spirit of standard objective smoothness requirements (Asm. 3.2). Assumption B.1. Suppose that for all h, Fh has a smooth gradient, i.e., for any f, f (S) that and has bounded range Fh(d) CF . Theorem B.1. Suppose that Asm. 2.1 and Asm. B.1 hold. Fix π ΠΘ. With probability at least 1 δ, JF (π) b JF (π) H2GLF p log(2p H|F|/δ) pd GH6G2 log(2Hp/δ) When Asm. 3.2 holds, this result directly leads to a stationary convergence guarantee similar to Cor. 3.1, by union bounding Thm. B.1 over all T then plugging it into Lem. G.5 (see proof of Cor. 3.1). We expect that the global convergence in Cor. 3.2 can also be extended with little overhead when {Fh} are convex, but leave a full investigation to future work. Proof of Thm. B.1 The analysis follows largely the same lines as the proof of Thm. 3.1. However, we must additional account for the error of approximating Fh(d) d(s) d= b dπ h with the plug-in occupancy estimate. This was unnecessary for the expected return in Sec. 3.2 since Fh(d) d(s) = Rh(s) is independent of the occupancy. First, for all h [H], with probability at least 1 δ we have occupancy estimates from Alg. 4 such that dπ h bdπ h 1 2 log(H|F|/δ) n := εmle, h [H]. This follows directly from Lem. D.1 with a union bound over H. Next, we isolate the occupancy estimation-related term from the error we would like to bound. Define b JF (π) := P d(s) |d= b dπ h log dπ h(s) i , and decompose JF (π) b JF (π) JF (π) b JF (π) + b JF (π) b JF (π) For the first term, JF (π) b JF (π) X d(s) |d=dπ h log dπ h(s) Fh(d) d(s) |d= b dπ h log dπ h(s) d(s) |d=dπ h Fh(d) d(s) |d= b dπ h d(s) |d=dπ h Fh(d) d(s) |d= b dπ h H2GLF max h dπ h bdπ h 1, H2GLF εmle. using Asm. B.1 in the second to last inequality. This takes care of the aforementioned occupancy estimation error. Conditioned on such {bdπ h}, the second pair of terms b JF (π) b JF (π) is analogous to the error bounded in Thm. 3.1, and the proof follows identically thereon, but with dependence on the range CF of the functionals. C Additional results and proofs for Sec. 4 C.1 Proofs for Sec. 4.1 Proof of Lem. 4.1 By passing the gradient through the clipped Bellman flow equation in Def. 4.2, we have s,a P(s |s, a) πh 1(a|s) dπ h 1(s) + π(a|s) dπ h 1(s) Cs h 1d D h 1(s) s,a P(s |s, a) πh 1(a|s) dπ h 1(s) Cs h 1d D h 1(s) log πh 1(a|s) + log dπ h 1(s) Cs h 1d D h 1(s) Next, dropping the h 1 subscript for a moment, observe that log dπ(s) Csd D(s) = log dπ(s), if dπ(s) < Csd D(s), 0, if dπ(s) > Csd D(s), with a discontinuity at dπ(s) = Csd D(s). For simplicity, we set log dπ(s) Csd D(s) = log dπ(s) 1[ dπ(s) Csd D(s)]. Similarly, we have log π(a|s) = log π(a|s) 1[π(a|s) CaπD(a|s)]. Finally, log dπ h(s ) = dπ h(s )/ dπ h(s ), where dπ h(s ) = P s,a P(s |s, a)πD h 1(a|s)d D h 1(s) ρπ h 1. The lemma statement follows from the definition of ED, ρ h 1, and the gradient magnitude bound results from invoking Lem. C.2 with σ (x, c) = (x c). Proof of Prop. 4.1 This result follows from applying Lem. C.7, which is a more general version of the proposition statement that holds for any (smooth-)clipping function, to σ (x, c) = (x c). Proof of Prop. 4.2 The MDP we will describe corresponds to a multi-armed bandit with 2 actions. Consider an MDP with H = 2, and S0 = {s0}, S1 = {s L, s R}, S2 = {s , s+} which are terminal. In any state there are two actions, A = {L, R}, with deterministic transitions. For the first level, we have s0 L s L and s0 R s R. For the second level, we have s L s and s R s+, regardless of action taken. For the reward function, R(s+) = 1 and is 0 otherwise. The policy is parameterized by a single parameter θ such that π(L) = 1 θ, and π(R) = θ, such that dπθ 1 (s R) = dπθ 2 (s+) = θ. Further, both the offline data and behavior policy are uniform in each level. Consequently, πD 0 (L) = πD 0 (R) = 1 2 and d D 1 = unif(S1). We set Cs 1 = Cs 2 = 2, and Ca 2 = 2 so that πh = πh for all h. Fix θ and estimated occupancies ˆdπθ and ˆd D. For any s S2 we have log dπθ 2 (s ) b log dπθ 2 (s ) = dπθ 1 (s ) 1[ ˆdπθ 1 (s ) ˆd D 1 (s )] 1[dπθ 1 (s ) d D 1 (s )] Next, we instantiate ˆdπθ, ˆd D for any πθ. The preconditions of the proposition are satisfied by an estimated occupancy with ˆdπθ 1 (s L) = θ + ϵ/2 and ˆdπθ 1 (s R) = θ ϵ/2. In addition, we have an estimate ˆd D with ˆd D 1 (s L) = 1/2 ϵ/2 and ˆd D 1 (s R) = 1/2 + ϵ/2. We will consider θ = 1/2, so that dπθ 1 Cs 1d D 1 . However, ˆdπθ 1 (s L) > ˆd D 1 (s L). As a result, log dπθ 2 (s ) b log dπθ 2 (s ) = dπθ 1 (s ) = O(1) C.2 Proofs for Sec. 4.2 First, we formally state and prove the claim that Lem. 4.2 can be reduced to minimizing a squaredloss regression problem recursively over timesteps, i.e., log edπ h+1 (11) = argmin g:S Rp EDh g(s ) log π 1 π, Ca hd D h + log edπ h 1 1 edπ h, Cs hd D h 2 . This is a reweighted offline analog of Eq. (2) from the online setting, and a more general version is presented below with proof. Lemma C.1. For g : S Rp and f : S A Rp and reweighting function ρ : S A R+, define the offline reweighted squared loss regression objective e Lh(g; f, ρ) = EDh[ρ(sh, ah) g(sh+1) f(sh, ah) 2]. Then for any such f, ED,ρ h (f) = argmin g:S Rp e Lh(g; f, ρ). Further, for the smooth-clipped density ratio eρπ h = σ( e dπ h(s),Cs hd D h (s)) d D h (s) eπh(a|s) πD h (a|s) and smooth-clipped target function yπ h := log π 1 π, Ca hd D h + log edπ h 1 edπ h, Cs hd D h from Lem. 4.2, we have log edπ h+1 = argmin g:S Rp e Lh (g; yπ h, eρπ h) . Proof of Lem. B.2. Since the objective is convex, can solve for the minimizer in closed form by taking the derivative and setting it to 0 in an element-wise manner. For each s , s,a P(s |s, a)πD h (a|s)d D h (s)ρ(s, a) s,a P(s |s, a)πD h (a|s)d D h (s)ρ(s, a)f(s, a). Rearranging and using the definition of ED,ρ h (Eq. (5)) gives the result. The second statement follows from Lem. 4.2. Proof of Prop. 4.3 Part 1 follows from the gradient formula xσ (x, c) = x β 1 x β + c β 1/β 1 = 1 + xβc β 1/β 1 . It can be seen that xσ (x, c) [0, 1] and is non-increasing in its inputs, thus σ is monotonic. Additionally, |σ (x, c) σ (x , c)| |x x |. Since σ is symmetric in its arguments, we also have |σs (x, c) σs (x, c )| |c c |. Next, we prove Part 2. Let z = (x c), and observe that z σ (x, c) z σ (z, z) since σ is monotonic. Further, z = z (2z β) 1/β z = 1 2 1/β 1 e 1/β 1 Rearranging and plugging in the expression for z gives the result. Part 3 can be derived algebraically (but not easily), and is best intuited from the plot of the maximum slope supx,x ,c,c [0,1] | 1 (x, c) 1 (x , c) |/|x x | in Figure C.2, which corresponds to Lσ/c in the RHS of the bound. The left plot shows that the maximum slope increases linearly in β, and the right plot shows it increases inversely with c. The dashed red line is a guess for the exact constant Lσ/c = 0.3β/c, that upper-bounds the maximum slope. Clearly, Lσ = O(β). 10 20 30 40 50 Maximum slope Varying ; fixed c = 0.5 0.2 0.4 0.6 0.8 1.0 c Varying c; fixed = 4 Figure 2: The y-axis plots the maximum slope supx,x ,c,c [0,1] | 1(x,c) 1(x ,c)| |x x | = Lσ/c. Proof of Lem. 4.2 Using the chain rule, s,a P(s |s, a) eπh 1(a|s) edπ h 1(s) + π(a|s) σ edπ h 1(s), Cs h 1d D h 1(s) s,a P(s |s, a)eπh 1(a|s)σ edπ h 1(s), Cs h 1d D h 1(s) log eπh 1(a|s) + log σ edπ h 1(s), Cs h 1d D h 1(s) s,a P(s |s, a)πD h 1(a|s)d D h 1(s)eρπ h 1(s, a) log eπh 1(a|s) + log σ edπ h 1(s), Cs h 1d D h 1(s) , where in the last line we use the definition of eρπ h 1 from Lem. 4.2 to make a change-of-measure. Further, log σ edπ h 1(s), Cs h 1d D h 1(s) = edπ h 1(s) xσ edπ h 1(s), Cs h 1d D h 1(s) = log edπ h 1(s) edπ h 1(s) xσ edπ h 1(s), Cs h 1d D h 1(s) = log edπ h 1(s) 1 edπ h 1(s), Cs h 1d D h 1(s) , by definition. We can make the analogous statement for log eπh 1. Next, using the same change of measure in Def. 4.3, we have edπ h(s ) = X s,a P(s |s, a)πD h 1(a|s)d D h 1(s)eρπ h 1(s, a). The lemma statement follows from log edπ h(s ) = edπ h(s )/edπ h(s ), and the definition of ED,eρ h 1 in Eq. (5). The gradient magnitude bound is proved in Lem. C.2. Lemma C.2 (Bounded gradient magnitude). Suppose σ is differentiable almost everywhere. Under Part 1 of Asm. 4.1 and Asm. 2.1, log edπ h(s) h G. Proof of Lem. C.2. As a consequence of Asm. 4.1, which states that the gradient of σ is nonincreasing in the first argument, for any x, c 0 we have σ (x, c) = Z x 0 xσs (z, c) dz Z x 0 xσs (x, c) dz = x xσ (x, c) Then substituting x edπ h 1(s) and c d D h 1(s), the above shows that log σ edπ h 1(s), d D h 1(s) log edπ h 1 pointwise. Since Lem. 4.2 involves a valid (conditional) expectation, for any s S we have log edπ h(s ) n log σ π(a|s), Ca h 1πD h 1(a|s) + log σ edπ h 1(s), Cs h 1d D h 1(s) log σ edπ h 1(s), Cs h 1d D h 1(s) h G where we use Asm. 2.1 in the second line, and unroll the same inequalities through levels in the last line. Proof of Prop. 4.4 First, we bound the difference between the soft-clipped and clipped density functions: edπ h dπ h 1 σ edπ h 1, Cs h 1d D h 1 dπ h 1 Cs h 1d D h 1 1 + max s eπh 1( |s) πh 1( |s) 1 For the second term and any s, eπh 1( |s) πh 1( |s) 1 = σ πh 1( |s), Ca h 1πD h 1( |s) πh 1( |s) Ca h 1πD h 1( |s) 1 Dσ πh 1( |s) Ca h 1πD h 1( |s) 1 Dσ For the first term, σ edπ h 1, Cs h 1d D h 1 dπ h 1 Cs h 1d D h 1 1 σ edπ h 1, Cs h 1d D h 1 edπ h 1 Cs h 1d D h 1 1 + edπ h 1 Cs h 1d D h 1 dπ h 1 Cs h 1d D h 1 1 Dσ edπ h 1 Cs h 1d D h 1 1 + edπ h 1 d π h 1 1 where in the last line we use Asm. 4.1 to upper bound the first term, and the properties of the pointwise minimum to upper bound the second term. Since edπ h 1 Cs h 1d D h 1 edπ h 1 dπ h 1, we have edπ h dπ h 1 2Dσ + edπ h 1 d π h 1 1 h(Dσ + Dσ) after rolling out the induction. Then for any π, J(π) e J(π) edπ h dπ h 1 2H2Dσ Lastly, let eπ = argmaxπ ΠΘ e J(π), and define π similarly. Then J( π ) e J(eπ ) J( π ) e J( π ) 2H2Dσ. C.3 Proofs for Sec. 4.3 First, we give an example of G that satisfies Asm. 4.2 in low-rank MDPs. Proposition C.1. Suppose M is a low-rank MDP (Def. B.1), and suppose µ is known. For each layer h, define the function class Gh = gh = µ Ψ µ ψ : Ψ Rk p, ψ Rk, gh h G, h [H] . Then {Gh} satisfies Asm. 4.2 and has pseudodimension (Def. F.1) d Gh = O(kp). Proof of Prop. C.1. It suffices to show that, for any f : S A Rp, reweighting function ρ : S A R+, and h [H], the gradient update in Lem. 4.2 has [ED,ρ h f] Gh+1. Fix ρ, f, and h. From the definition of Eρ f, we have [Eρ hf](s ) = s,a P(s |s, a)πD h (a|s)d D h (s, a)ρ(s, a)f(s, a) P s,a P(s |s, a)πD h (a|s)d D h (s, a)ρ(s, a) Then since P(s |s, a) = ϕ(s, a), µ(s ) , we can apply the same steps as the proof of Prop. B.1 to show that there exists Ψ Rk p and ψ Rk such that [Eρ hf](s ) = µ(s ) Ψ µ(s ) ψ , s S. Specifically, ψ = P s,a ϕ(s, a)πD h (a|s)d D h (s, a)ρ(s, a), and the p-th column of Ψ is Ψp = P s,a ϕ(s, a)πD h (a|s)d D h (s, a)ρ(s, a)f p(s, a). Proof of Thm. 4.1 For the remainder of this section, we define the constants εw and εmle to be the estimation errors of ewπ and d D, respectively, such that for a given π and any h [H] we have bwπ h ewπ h 1,d D, h 1 εw bd D h d D h 1 εmle and bd D, h d D, h 1 εmle. We can obtain such estimates using Alg. 4 and Alg. 5, and a direct application of Lem. D.1 with union bound gives εmle = O q log(H|F|/δ) , and similarly Thm. E.1 states that εwreg = log(H|W|/δ) Next, recall that b e J(π) = 1 (s,a,s ) Dgrad h bwπ h(s )Rh(s )bgπ h(s ). The expected value over draws of Dgrad is h b e J(π) i = X h Es d D, h 1 [ bwπ h(s )Rh(s )bgπ h(s )] . First we bound the statistical error from using samples to approximate e J(π), given the gradient estimate. Fix the other datasets, then b e J(π) EDgrad h b e J(π) i p max p [p] b Es d D, h 1 [ bwπ h(s )Rh(s )bgπ h(s )] Es d D, h 1 [ bwπ h(s )Rh(s )bgπ h(s )] h Cs h Ca h max p [p],h [H] b Es d D, h 1 [bgπ h(s )] Es d D, h 1 [bgπ h(s )] h Cs h Ca h where εstat = HG q log(8p H/δ) 2n is obtained by using Hoeffding s with δ = δ/4, since the randomness in bw and bg are fixed. Then for any p [p], EDgrad h h b e J(π) i e J(π) s d D, h 1(s )Rh(s ) bwπ h(s ) bgπ h(s ) wπ h(s ) log edπ h(s ) s d D, h 1(s )Rh(s )wπ h(s ) bgπ h(s ) log edπ h(s ) + s d D, h 1(s )Rh(s )byπ h(s ) ( ˆwπ h(s ) wπ h(s )) h G bwπ h wπ h 1,d D, h 1 + max p [p] [bgπ h]p log edπ h 1, e dπ h The first term is bounded by Thm. E.1. For the second term, we use the following decomposition, which is proved at the end of this section. Lemma C.3 (Gradient estimation error decomposition). Let εmle and εw be such that for all h [h] and π ΠΘ, we have bd D h d D h 1, bd D, h d D, h 1 εmle h and bwπ h ewπ h 1,d D, h 1 εw h . Then under Asm. 2.1 and Asm. 4.1, for any p [p], bgπ h from Alg. 2 satisfies bgπ,p h p log edπ h 1, e dπ h 6h Cs h 1Ca h 1LσG εmle h 1 (data distribution estimation error) + 3h LσG εw h 1 (occupancy estimation error) + bgπ,p h [ED,bρ h 1( log eπh 1 + bgh 1)]p 1, e dπ h (statistical regression error) + bgπ,p h 1 p log edπ h 1 1, e dπ h 1 (recursive term) From Lem. C.4, we have [ED,bρ h 1bgh 1]p bgπ,p h 1, e dπ h q 2 1 + Cs h 1εmle h 1 εreg h + 2h G 2Cs h 1εmle h 1 + εw h 1 where εreg h = O( q d GCs h 1Ca h 1h2G2 log(np H/δ) n ). Then plugging the above into the decomposition in Lem. C.3, we have bgp h p log edπ h 1, e dπ h 10Cs h 1Ca h 1h GLσ εmle h 1 + 5h GLσ εw h 1 2 1 + Cs h 1εmle h 1 εreg h + bgp h 1 p log edπ h 1 1, e dπ h 1 Unrolling through timesteps, we have bgp h p log edπ h 1, e dπ h 10h2GLσ X g 0 only if s Sh. Now define π+ such that for any s, eπ+( |s) = argmaxπ (A) D eπ, e Qπθ h (s, ) E . Then e J(πE) e J(πθ) sh,ah σ edπE h (sh), d D h (sh) eπE(ah|sh) eπθ(ah|sh) e Qπθ h (sh, ah) sh,ah σ edπE h (sh), d D h (sh) eπ+(ah|sh) eπθ(ah|sh) e Qπθ h (sh, ah) (20) σ edπE h , Cs hd D h σ edπθ h , Cs hd D h sh,ah σ edπθ h (sh), d D h (sh) π+(ah|sh) πθ(ah|sh) 1 πθ(ah|sh), Ca hπD(ah|sh) e Qπθ h (sh, ah) σ edπE h , Cs hd D h σ edπθ h , Cs hd D h sh,ah σ edπθ h (sh), d D h (sh) eπ+(ah|sh) eπθ(ah|sh) e Qπθ h (sh, ah) σ edπE h , Cs hd D h σ edπθ h , Cs hd D h e Bπ+(πθ) e Bπθ(πθ) σ edπE h , Cs hd D h σ edπθ h , Cs hd D h e Bπθ (πθ) e Bπθ(πθ) Proof of Prop. C.2 Under all-policy coverage, we can apply the result in Lem. 3.2, noting that d 0 = 1 H d D h , and dπ h Cs hd D h . P(X|S, a) = ϵ Figure 3: Example in Prop. C.3 Proposition C.3 (Vanishing gradient from clipping with exploratory data). Consider the MDP in App. C.6, and the data distribution where d D(X) = 1/2 and d D(Y ) = ϵ and d D(Z) = (1 ϵ)/2 for some ϵ [0, 1]. For any C, we have all-policy coverage, i.e., dπ h Chd D h for all h and all policies π. Let π be the stationary (and in this case, optimal) policy of running Alg. 2 with D described in Prop. C.2. Then J(π ) J(π) = (1 ϵ) (1 2CY ϵ) . If ϵ is exponentially small, J(π ) J(π) = O(1) unless CY is exponentially large. Proof. The example boils down to a simple bandit problem of choosing either L or R in state X. π(L|X) = CY d D(Y ) d D(X) = 2CY ϵ, and π(R|X) = 1 π(L|X). In comparison, π (L|X) = 1. Then e J(π) = J(π) = CY d D(Y ) d D(X) + ϵ(1 π(L|X)). In comparison, J(π ) = 1, so J(π ) J(π) = (1 ϵ) (1 2CY ϵ) For reasonable choices of CZ (say, 2 or 3), CY must be proportional to ϵ 1 for the suboptimality gap to shrink, and in particular if ϵ is exponentially small then CY must be exponentially large, which blows up the RHS of the bound. Proof of Cor. C.2 The first step follows the proof of Cor. 3.2. Combining Thm. 4.1 with Lem. G.5 and plugging in above, we have t e J(π ) e J(π(t)) h Cs h Ca h)2 L2σ log(N (ε, G) N D (ε, ΠΘ)|F||W|) n Then we also have t J(π ) J(π(t)) t e J(π ) e J(π(t)) + 2H2Dσ (Prop. 4.4) 2H2Dσ + Bm CC h Cs h Ca h)2 L2σ log(N (ε, G) N D (ε, ΠΘ)|F||W|) n Additional results Helper lemmas are stated and proved below. Lemma C.7. Suppose σ satisfies Parts 1 and 2 of Asm. 4.1. Then for e J(π) = P s edπ h(s)Rh(s), s,a σ edπ h(s), Cs hd D h (s) eπh(a|s) e Qπ h(s, a), e Qπ h(s, a) = X s Ph(s |s, a) Rh+1(s ) + X a eπh+1(a |s ) xσ edπ h+1(s ), Cs h+1d D h+1(s ) e Qπ h+1(s , a ) Proof of Lem. C.7. For notational clarity we omit Cs h below. Expanding edπ, we have edπ h(sh) = X sh 1,ah 1 P(sh|sh 1, ah 1) eπh 1(ah 1|sh 1)σ edπ h 1(sh 1), d D h 1(sh 1) + eπh 1(ah 1|sh 1) xσ edπ h 1(sh 1), d D h 1(sh 1) edπ h 1(sh 1) (sh 1,ah 1,...,sg,ag) t=g+1 P(st+1|st, at)eπt(at|st) xσ edπ t (st), d D t (st) # P(sg+1|sg, ag)σ edπ g (sg), d D g (sg) eπg(ag|sg) For short, define e Peπ(sh|sg, ag) := X (sh 1,ah 1,...,sg+1,ag+1) t=g+1 P(st+1|st, at)eπt(at|st) xσ edπ t (st), d D t (st) # P(sg+1|sg, ag) observing if eπh = πh and xσ edπ h, d D h = 1 for all h, we have e Peπ(sh|sg, ag) = Pπ(sh|sg, ag), the standard transition kernel from (sg, ag) sh. This occurs, for example, when σs is hard clipping and π is fully covered by data. Then using the above definition, we have edπ h(sh) = X sg,ag σ edπ g (sg), d D g (sg) eπg(ag|sg)e Peπ(sh|sg, ag). (21) Plugging this expression into J(π), we obtain sh edπ h(sh)R(sh) sg,ag σ edπ g (sg), d D g (sg) eπg(ag|sg)e Peπ(sh|sg, ag) sg,ag σ edπ g (sg), d D g (sg) eπg(ag|sg) e Peπ(sh|sg, ag)R(sh) sg,ag σ edπ g (sg), d D g (sg) eπg(ag|sg) e Qπ(sg, ag) where we have defined e Qπ g (sg, ag) := e Peπ(sh|sg, ag)R(sh) sg+1 P(sg+1|sg, ag) R(sg+1) + X ag+1 eπg+1(ag+1|sg+1) xσ edπ g+1(sg+1), d D g+1(sg+1) e Qeπ g+1(sg+1, ag+1) Lemma C.8. If σ is concave in its first argument, for any π and π we have e J(π ) e J(π) s,a σ edπ h (s), d D h (s) (eπ h(a|s) eπh(a|s)) e Qπ h(s, a), where e Qπ h is defined in Lem. C.7. Proof. This statement follows straightforwardly from plugging in Lem. C.9 and rearranging, similar to the proof of Lem. C.7. Lemma C.9. If σ is concave in its their first arguments, then for any h and π, π edπ h (s ) edπ h(s ) s,a σ edπ g (s), d D g (s) σ π g(a|s), πD g (a|s) σ πg(a|s), πD g (a|s) e Pπ(sh = s |sg = s, ag = a), e Pπ(sh|sg, ag) := X sh 1:g+1,ah 1:g+1 t=g+1 P(st+1|st, at)eπt(at|st) xσ edπ t (st), d D t (st) # P(sg+1|sg, ag). Proof of Lem. C.9. Define πg = {π 1, . . . , π g 1, πg, . . . , πH 1}, i.e., a policy that starts playing π at timestep g, and plays π for the timesteps before that. edπ h (s ) edπ h(s ) = edπ h (s ) edπh 1 h (s ) + edπh 1 h (s ) edπ h(s ) For the first pair of terms, π and πh 1 only differ the policy used to take the action ah 1 (and both play π before that), thus dπ h 1 = dπh 1 h 1 and edπ h (s ) edπh 1 h (s ) s,a P(s |s, a) σ π h 1(a|s), πD h 1(a|s) σ πh 1(a|s), πD h 1(a|s) σ edπ h 1(s), d D h 1(s) For the second pair of terms, πh 1 and π both play π at time h 1, but the former uses π for timesteps 1, . . . , h 2: edπh 1 h (s ) edπ h(s ) s,a P(s |s, a)σ πh 1(a|s), πD h 1(a|s) σ edπh 1 h 1 (s), d D h 1(s) σ edπ h 1(s), d D h 1(s) s,a P(s |s, a)σ πh 1(a|s), πD h 1(a|s) σ edπ h 1(s), d D h 1(s) σ edπ h 1(s), d D h 1(s) s,a P(s |s, a)σ πh 1(a|s), πD h 1(a|s) xσ edπ h 1(s), d D h 1(s) edπ h 1(s) edπ h 1(s) where the last inequality above uses the concavity of σs in the first argument (recall concave functions f satisfy f(y) f(x) + f (x)(y x)). Combining the above two inequalities, we have the recursive relationship edπ h (s ) edπ h(s ) s,a P(s |s, a) σ π h 1(a|s), πD h 1(a|s) σ πh 1(a|s), πD h 1(a|s) σ edπ h 1(s), d D h 1(s) + σ πh 1(a|s), πD h 1(a|s) xσ edπ h 1(s), d D h 1(s) edπ h 1(s) edπ h 1(s) ! Unrolling through timesteps gives the lemma statement. Lemma C.10. Let Ca = maxh Ca h. Suppose ΠΘ is the direct policy parameterization, i.e., πθ(a|s) = θs,a, and σ is such that Dσ Ca for all h. Then for any γ (0, 1], in Def. C.1 we have N D (γ, ΠΘ) (Ca/γ)SAH. Proof of Lem. C.10. Typical gridding-style arguments discretize the range of π(a|s) for each (s, a). Since we are concerned with creating a cover for the policy ratio, however, a naive argument will incur 1/ mins,a πD(a|s) in the grid s cardinality. Our solution is to grid ΠΘ adaptively according to the magnitude of πD(a|s). Intuitively, we only need to grid up to the threshold For each (h, s, a), define the adaptive gridding scale to be γ hsa = γπD h (a|s). For any π ΠΘ, set its cover π as follows. k , if π(a|s) Ca hπD h (a|s), Ca hπD h (a|s), otherwise. Let ΠΘ = {π : π ΠΘ}. Then |ΠΘ| (maxh Ca h/γ)HSA. Further, π(a|s) Ca hπD h (a|s) πh(a|s) Ca hπD h (a|s) γπD h (a|s), thus (π Ca hπD h ) (πh Ca hπD h ) πD h γ, and applying Lem. C.11 gives the result. Lemma C.11. Suppose ΠΘ satisfies Def. C.1 with σxc = (x c). Then for any π ΠΘ, let π ΠΘ be its cover. Under Asm. 4.1, we have σ(π,CπD) πD σ(π,CπD) Proof of Lem. C.11. If π(a|x) CπD(a|x), using the 1-Lipschitzness of σ we have |σ π(a|s), CπD(a|s) σ π(a|s), CπD(a|s) | = |σ π(a|s) CπD(a|s) , CπD(a|s) σ π(a|s), CπD(a|s) | | π(a|s) CπD(a|s) π(a|s)| Algorithm 4 Maximum Likelihood Estimation Input: datasets {Dh}, function class F 1: for h = 1, . . . , H do 2: Estimate marginal data distributions bd D h 1 and bd D, h 1 by MLE on dataset Dh 1 bd D h 1 = argmax dh 1 Fh 1 (s, , ) Dh 1 log (dh 1(s)) (22) bd D, h 1 = argmax dh Fh ( , ,s ) Dh 1 log (dh(s )) . 3: end for Output: estimated data distributions {bd D h }h [H] and {bd D, h }h [H] If π(a|x) > CπD(a|x), |σ π(a|s), CπD(a|s) σ π(a|s), CπD(a|s) | CπD(a|s) σ π(a|s), CπD(a|s) CπD(a|s) (1 Dσ) π(a|s) CπD(a|s) C(Dσ + γ)πD, using Asm. 4.1 in the second inequality. As a result, σ(π(a|s),CπD(a|s)) πD(a|s) σ(π(a|s),CπD(a|s)) D Maximum Likelihood Estimation Algorithm 4 displays the data distribution estimation procedure used in offline gradient estimation (Algorithm 2), which is a direct application of MLE. The general formulation of the MLE problem utilized in this paper is to estimate a probability distribution over the instance space S. Given an i.i.d. sampled dataset D = {s(i)}n i=1 and a function class F, we optimize the MLE objective of the form bf = argmin f F s D log (f(s)) . (23) We assume F is finite, and refer readers to [LNSJ23; HCJ23] for techniques for handling infinite function classes. The general MLE guarantee is stated below, and is a well-established result (for example, a proof can be found in Appendix E of [AKKS20]). Lemma D.1 (MLE guarantee). Let D = {s(i)}n i=1 be a dataset, where s(i) are drawn i.i.d. from some fixed probability distribution f over S. Consider a function class F that satisfies: (i) f F, and (ii) each function f F is a valid probability distribution over S (i.e., f (S)) Then with probability at least 1 δ, bf from Eq. (23) has ℓ1 error guarantee 2 log(|F|/δ) The formal guarantee of Algorithm 4 is stated below, which is a straightforward application of Lemma D.1 with union bound (over all functions in F, and over all timesteps). Assumption D.1 (MLE Realizability). Suppose that h [h], we have d D h , d D, h 1 Fh for D defined in Def. 4.1. Additionally, f (S) is a valid distribution for all f Fh. Algorithm 5 Fitted Occupancy Iteration with Smooth Clipping Input: policy π, datasets {Dh}, function class W, clipping thresholds {Cs h, Ca h}, data estimates {bd D h } and {bd D, h }. 1: Initialize bdπ 0 = bd D 0 . 2: for h = 1, . . . , H do 3: Define LDh 1(wh, wh 1, eπh 1) := 1 |Dh 1| P (s,a,s ) Dh 1 wh(s ) wh 1(s) eπh 1(a|s) πD h 1(a|s) 2 , and estimate bwπ h = argmin wh Wh LDh 1 wh, σ( b dπ h 1,Cs h 1 b d D h 1) b d D h 1 , σ πh 1, Ca h 1πD h 1 , (24) 4: Set the estimate bdπ h = bwπ h bd D, h 1. 5: end for Output: estimated state occupancies { bwπ h}h [H]. Lemma D.2. Suppose {Fh} satisfies Asm. D.1. Then with probability at least 1 δ, for all h [H] the outputs of Algorithm 4 satisfy bd D h d D h 1 εmle and bd D, h d D, h 1 εmle, where 2 log(2H|F|/δ) E Offline Density Estimation The algorithm for offline density estimation is displayed in Alg. 5, and is directly copied from Algorithm 1 of [HCJ23], but with two minor modifications. The first is that the densities are clipped using a function σ, that can take clipping as a special case. The second is that it outputs the learned weights instead of the learned densities. The weight function class completeness assumption is shown Asm. E.1, and is satisfied in low-rank MDPs using linear-over-linear function classes that have pseudo-dimension bounded by MDP rank. It can be seen as a 1-dimensional version of Asm. 4.2 where ρ = 1 and in that sense strictly weaker. Assumption E.1 (Weight function completeness). For any π ΠΘ and h [H], we have σ(w f ,Cs h 1f) f eπh 1 πD h 1 Wh, w Wh 1, f, f Fh 1, Theorem E.1. Suppose σ satisfies Asm. 4.1 and W satisfies Asm. E.1. Let {bd D h }g and {bd D, h } be such that h [H], bd D h d D h 1 εmle and bd D, h d D, h 1 εmle. Then with probability at least 1 δ, the outputs { bwπ h} of Algorithm 5 satisfy for all h [H] bwπ h ewπ h 1,d D, h 1 g