# occupancybased_policy_gradient_estimation_convergence_and_optimality__fde3e9bc.pdf

Occupancy-based Policy Gradient: Estimation, Convergence, and Optimality

Audrey Huang Department of Computer Science University of Illinois Urbana-Champaign Champaign, IL 61820 audreyh5@illinois.edu

Nan Jiang Department of Computer Science University of Illinois Urbana-Champaign Champaign, IL 61820 nanjiang@illinois.edu

Occupancy functions play an instrumental role in reinforcement learning (RL) for guiding exploration, handling distribution shift, and optimizing general objectives beyond the expected return. Yet, computationally efﬁcient policy optimization methods that use (only) occupancy functions are virtually non-existent. In this paper, we establish the theoretical foundations of model-free policy gradient (PG) methods that compute the gradient through the occupancy for both online and ofﬂine RL, without modeling value functions. Our algorithms reduce gradient estimation to squared-loss regression and are computationally oracle-efﬁcient. We characterize the sample complexities of both local and global convergence, accounting for both ﬁnite-sample estimation error and the roles of exploration (online) and data coverage (ofﬂine). Occupancy-based PG naturally handles arbitrary ofﬂine data distributions, and, with one-line algorithmic changes, can be adapted to optimize any differentiable objective functional.

1 Introduction

Value-based methods have been the dominant paradigm in model-free reinforcement learning, with a solid theoretical foundation in large state spaces under function approximation [CJ19; JYWJ20a; ZLKB20; JLM21; XJ21; XFBJK22]. In contrast, a model-free RL paradigm based on their natural counterparts the occupancy functions remains largely under-investigated. Occupancy functions are densities that describe a policy s state visitation, and play instrumental roles in guiding exploration [HKSVS19; AFK24], handling distribution shift [HM17; NCDL19; CJ22], and optimizing general objectives beyond the expected return [ZBWK20; MDSDBR22]. Despite this, they are seldom modeled directly in learning algorithms and appear only in the analyses, except in conjunction with value functions in marginalized importance sampling [LLTZ18; NDKCLS19; UHJ20; ZHHJL22; HJ22a]. Recently, [HCJ23] developed algorithms in online and ofﬂine RL that model only occupancies via density function classes, spotlighting their roles in handling non-exploratory ofﬂine data and in online exploration. However, their focus was on statistical guarantees, and computationally efﬁcient policy optimization for occupancy-based methods remained an open problem.

In answer, we develop model-free policy gradient (PG) algorithms that compute the gradient through occupancy functions, without estimating any values. By leveraging a Bellman-like recursion, we reduce occupancy-based gradient estimation to solving a series of squared-loss minimization problems, which can be done in a computationally oracle-efﬁcient manner. Our analysis captures the effects of gradient estimation error, exploration (in online PG, which is characterized by the initial state distribution), and ofﬂine data quality (in ofﬂine PG) on the sample and iteration complexity required for local and global convergence. In the online setting, our results complement previous works on the optimality of value-based PG [AKLM21; BR24] and extend past their scope to in-

38th Conference on Neural Information Processing Systems (Neur IPS 2024).

clude general objectives of occupancy functions, such as entropy maximization for pure exploration and risk-sensitive functionals in safe RL [MDSDBR22]. These objectives generally cannot be optimized using value-based policy gradients because they do not admit value functions or Bellman-like equations with which to estimate them [ZBWK20; HDGP24].

In the ofﬂine setting, we handle gradient estimation from ﬁxed datasets of poor coverage, which departs from most existing (value-based) off-policy PG estimators that assume an exploratory dataset [KU20; XYWL21; NZJZW22]. Learning with non-exploratory data is a core consideration in recent ofﬂine RL [XCJMA21; ZHHJL22], and gives rise to unique challenges in our setting: occupancies are converted into density ratios for learning purposes, but these ratios become unbounded when the data lacks coverage. [HCJ23] used clipping to handle occupancy estimation under poor coverage, which we show is insufﬁcient for gradient estimation (Prop. 4.2). Instead, a novel smooth-clipping mechanism (Sec. 4.2) is developed to provide statistically robust gradient estimates.

App. A includes a full discussion of related work, and our contributions are organized as follows:

1. Online PG (Sec. 3) We propose OCCUPG, an occupancy-based PG algorithm that reduces gradient estimation to squared-loss minimization, based on a recursive Bellman ﬂow-like update for the occupancy gradient. We analyze the sample complexities for both local and global convergence, and, notably, our algorithm and analyses extend straightforwardly to the optimization of general objective functionals. 2. Ofﬂine PG (Sec. 4) For ofﬂine RL we develop and analyze OFF-OCCUPG, which optimizes only the portions of a policy s return that are adequately covered by ofﬂine data. Conceptually, our algorithm is based on combining the methods in Sec. 3 with (a smoothed version of) the recursively clipped occupancies from [HCJ23]. As a result, our estimation and convergence guarantees do not require assumptions on data coverage, which relaxes the restrictions of previous works.

2 Preliminaries

Finite-horizon Markov decision process (MDP). Finite-horizon MDPs are deﬁned by the tuple M = (S, A, P, R, H, d0), where S is the state space, A is the action space, and H is the horizon. We use [H] = {0, . . . , H} and when clear from the context, use { h} = { h}h [H]. For notational compactness we assume that S = h Sh is the union of H disjoint sets {Sh}, each of which is the set of states reachable at timestep h. This is WLOG as we can always augment the state space with [H] at the cost of only H factors [JKALS17; MBFR24].

Since each state can only be visited at a single timestep, we can now deﬁne the (non-stationary) transitions as P : S A (S), and the initial state distribution as d0 (S0). We assume the reward function R : S [0, 1] is bounded on the unit interval and (for simplicity) state-wise deterministic. This sufﬁciently captures the challenges of our setting since the occupancies are densities over states, and it will be easily seen later that our results generalize to per-state-action rewards. A policy π : S (A) interacting with M observes trajectories {(sh, ah, sh+1, rh+1)}H 1 h=0 , and has expected return J(π) = Eπ[PH h=1 R(sh)]. At any (h, s, a), its expected return-to-go is encoded in the value function Qπ h(s, a) = Eπ[P

h >h R(sh )|sh = s, ah = a].

For each h [H], a policy s occupancy function dπ h (S) is a p.d.f. describing its state visitation, dπ h(s) = Pπ(sh = s). In combination with the policy, the MDP dynamics dictate the evolution of the occupancy over timesteps. This is encoded in the recursive Bellman ﬂow equation, which mandates that dπ h = Pπdπ h 1 for all h [H]. Here, Pπ is the Bellman ﬂow operator with (Pπf)(s ) := P

s,a P(s |s, a)π(a|s)f(s) R , for any function f : S R .

Policy optimization. For an objective function f : ΠΘ R, the general goal of this work is to ﬁnd argmaxπθ ΠΘ f(πθ) over a policy class ΠΘ = {πθ : θ Θ, θ Rp, θ B}, parameterized by a convex and closed parameter class Θ with dimension p. One example of f is the expected return J(πθ). Projected gradient ascent (PGA) will be our base algorithm for policy optimization. For a ﬁxed learning rate η and iterations t [T], it iteratively updates θ(t+1) = ProjΘ θ(t) + η f(πθ(t)) . Here, f(πθ) = [ f(πθ)

θp ]p [p] Rp is the gradient with respect to θ, where superscript p indexes the p-th entry of a vector. We will assume that the gradient of the policy s log-probability is bounded, as is ubiquitous in the PG literature [LSAB19; AKLM21].

Assumption 2.1. For all πθ ΠΘ, maxs,a log πθ(a|s) G.

We will later analyze the convergence rate of our algorithms to stationary points with (approximately) zero gradient, and refer to π(t) = πθ(t) for short. For PGA, stationarity will be measured using the standard gradient mapping Gη(π(t)) with Gη(π(t)) := 1

η(θ(t) θ(t+1)), the parameter change between iterations [Bec17]. Note that if Θ = Rp and no projection is required, then Gη(π(t)) = f(π(t)) reduces to the gradient magnitude.

Computational oracles. As is common in the literature, we analyze computational efﬁciency in terms of the number of calls to the following oracles, which serve as computational abstractions. We desire a polynomial number of such calls in terms of problem-relevant parameters. Given an i.i.d. dataset D = {(x, y)} and function class F, the maximum likelihood estimation oracle outputs argmaxf F ED[log f(x)]. The squared-loss regression oracle ﬁnds argminf F ED[(f(x) y)2]. Both can be approximated efﬁciently whenever optimizing over F is feasible [MHKL20; FR20].

3 Online Occupancy-based PG

We now develop our occupancy-based policy optimization algorithm for the online RL setting, where the policy can continuously interact with the environment to gather new trajectories. Our gradient estimation routine is based on a recursive Bellman ﬂow-like equation that can be approximately solved using squared-loss regression, not unlike those used to estimate occupancy functions in FORC [HCJ23] or value functions in FQI [ASM07]. The intuitions established for our online algorithm form the foundation for our later ofﬂine methods.

3.1 Occupancy-based Policy Gradient

The expected return of a policy π can be expressed as the expectation over its occupancy of the per-state rewards, J(π) = P

sh dπ h(sh)R(sh). The gradient of J(π) then passes through dπ,

sh dπ h(sh)R(sh) = P

h Esh dπ h [ log dπ h(sh)R(sh)] .

We use the grad-log trick above to write J(π) as an expectation over dπ, which makes it amenable to estimation from online samples as long as we can calculate log dπ h : S Rp. We make the key observation that log dπ h can be expressed as a function of log dπ h 1, which involves a time-reversed conditional expectation over the previous timestep s (sh 1, ah 1) given the current sh. Lemma 3.1. For any π and h [H], log dπ h satisﬁes the recursion

log dπ h = Eπ h 1 log π + log dπ h 1 , (1)

where [Eπ h 1f](s ) := Eπ[f(sh 1, ah 1)|sh = s ] = P s,a P (s |s,a)π(a|s)dπ h 1(s) dπ h(s ) f(s, a)1, for any function f : S A Rp. Further, under Asm. 2.1, maxs,h log dπ h(s) h G.

Eq. (1) is derived by propagating the gradient through the Bellman ﬂow equation, and we can solve it from h = 1 to H to compute log dπ h (with log dπ 0 = 0 by deﬁnition). While related observations have been made throughout the rich history of PG literature [CC97; MT01; KU20; XYWL21], the expression in Eq. (1) is adapted to our unique pursuit of modeling log dπ with general function approximators. In particular, the conditional expectation (Eπ) immediately hints that log dπ is amenable to estimation using squared-loss regression, a technique that is well-understood for value functions [SB18] and, more recently, for occupancy functions [HCJ23].

Formally, to solve the dynamic programming equation of Eq. (1) in a computationally efﬁcient manner, we reduce it to minimizing a squared-loss regression problem. Consider the standard (supervised learning) regression setup. The solution to argminf E(x,y) Q[(f(x) y)2] maps x 7 EQ[y|x], the conditional expectation given x of the target y under the joint Q. As a result (see Lem. B.2),

log dπ h = argming:S Rp Eπ h g(sh) log π(ah 1|sh 1) + log dπ h 1(sh 1) 2i . (2)

1We use the convention 0/0 = 0 for ratios between two functions.

Algorithm 1 OCCUPG: Online Occupancy-based Policy Gradient Input: Samples n; iterations T; policy class ΠΘ; gradient function class {Gh}; learning rate η

1: for t = 0, . . . , T 1 do 2: Collect n trajectories with π(t). Set Dreg h = {(sh, ah, sh+1)}n i=1 for all h. Repeat for {Dgrad h }. 3: Initialize g0 = 0. 4: for h = 1, . . . , H do

5: Let L(t) h 1(gh; gh 1) := 1

(s,a,s ) Dreg h 1 gh(s ) log π(t)(a|s) + gh 1(s) 2. Set

bg(t) h = argmingh Gh L(t) h 1(gh; bg(t) h 1). (3)

6: end for 7: Estimate b J(π(t)) = 1

(s,a,s ,r ) Dgrad h 1 bg(t) h (s ) r

8: Update θ(t+1) = ProjΘ θ(t) + η b J(π(t)) .

Here, g is a vector-valued function, and the norm 2 is equivalent to the sum of p scalar-valued squared-losses for each parameter dimension. The RHS only requires sampling (sh 1, ah 1, sh) π from online rollouts. Then, given ﬁnite samples, we can robustly estimate log dπ by minimizing an empirical version of Eq. (2) using regression oracles.

3.2 Online policy gradient algorithm and analyses

Alg. 1 (OCCUPG) displays our full online occupancy-based PG procedure. For each iteration t [T], we ﬁrst collect two independent datasets: {Dreg h } for log dπ(t) estimation, and {Dgrad h } for J(π(t)) estimation. The former occurs in Line 5, where we recursively solve an empirical version of Eq. (2); the latter is computed in Line 7, then used to update the policy (Line 8).

Gradient estimation guarantee. In the following, we establish that our regression-based estimation procedure produces accurate estimates of J(π). Our guarantee holds under the requirement that the gradient function classes {Gh} can express the population gradient update (Lem. 3.1) for any target function. It is analogous to the Bellman completeness assumption that is required for regression-based value or occupancy function estimation [CJ19; HCJ23]. Assumption 3.1 (Gradient function class completeness). For all h [H], supg Gh,s S gh(s) h G. Further, for all π ΠΘ, we have Eπ h 1( log π + gh 1) Gh, for all gh 1 Gh 1.

Next, since we allow G to be a continuous function class, our sample complexity bound for gradient estimation is expressed in terms of its pseudodimension := d G (Def. F.1). Examples of G parameterizations and their d G are discussed in Rem. 3.1 below. Finally, Thm. 3.1 shows that OCCUPG produces accurate gradient estimates given the following polynomial sample size. Theorem 3.1. Fix δ (0, 1) and π ΠΘ. Under Asm. 2.1 and Asm. 3.1, we have that w.p. 1 δ,

J(π) b J(π) ε when n = e O pd GH6G2 log(1/δ)

Remark 3.1. Lastly, we provide examples of G for Asm. 3.1 in representative MDP structures. Lowrank MDPs (Def. B.1) are a well-studied setting where the transition function admits a low-rank decomposition into two features of rank k, i.e., there exists ϕ : S A Rk and µ : S Rk such that P(s |s, a) = ϕ(s, a), µ(s ) [JYWJ20b]. Tabular MDPs are a special case with onehot features. Due to the bilinear transitions, both the occupancy and its gradient are linear functions of µ, i.e., dπ = µ(s) ψ and dπ(s) = µ(s) Ψ for some Ψ Rk p, ψ Rk, and all s S. When µ is known, we can set Gh to be a linear-over-linear function class Gh = n gh(s) = µ(s) Ψ

µ(s) ψ : Ψ Rk p, ψ Rk, maxs gh(s) h G o , which has d G = kp (Prop. B.1).

Stationary convergence. Next, we analyze the convergence rate of OCCUPG to a stationary policy, i.e., one that has near-zero gradient. Note that, in general, stationary policies are not necessarily optimal as the objective function is non-convex. As is standard in the literature, we will assume that the objective has a smooth gradient [LSAB19; AKLM21].

Assumption 3.2 (β-smooth objective). For a function f : ΠΘ R , there exists β > 0 such that f(πθ) f(πθ ) 2 β θ θ 2 for all θ, θ Θ.

Cor. 3.1 shows that, in expectation, OCCUPG with T = O(βH/ε) iterations outputs a ε-stationary point, as measured by Gη(π(t)) = 1 η θ(t) θ(t 1) . The proof relies on Thm. 3.1, i.e., with enough samples the statistical noise of the gradient estimates are sufﬁciently small to enable convergence. Corollary 3.1. Under Asm. 2.1, Asm. 3.1, and Asm. 3.2, the iterates of OCCUPG with T = O(βHε 1) and n = e O(pd GH6G2 log(T/δ)ε 1) satisﬁy 1

T PT t=1 E[ Gη(π(t)) 2] ε.

Computational efﬁciency. OCCUPG is not only statistically efﬁcient but computationally oracleefﬁcient as well, since it reduces to a series of squared-loss minimization problems. In each iteration, it makes H calls to a regression oracle to compute the occupancy gradient (Line 5). Then to converge to a ε-stationary point, from Cor. 3.1 we require a total of O(βH2/ε) such calls.

Optimality. Lastly, we analyze when the policies recovered by OCCUPG are also approximately optimal. The key inequality is an upper bound on the suboptimality of any policy in terms of its gradient magnitude (or stationarity), and a coverage coefﬁcient Cπ with respect to the optimal policy. Lemma 3.2. For any π and π , deﬁne Bπ(π ) := P h,s,a dπ h(s)π (a|s)Qπ h(s, a). Suppose π ΠΘ,

1. (Policy completeness) There exists π+ ΠΘ such that π+ argmaxπ Bπ(π ). 2. (Gradient domination) maxπ ΠΘ Bπ(π ) Bπ(π) m maxθ Θ Bπ(π), θ θ

Given ν (S), deﬁne the coverage coefﬁcient Cπ := P

h dπ h /ν for π = argmaxπ J(π). Then for any πθ ΠΘ,

J(π ) J(πθ) m Cπ max θ ΠΘ Jν(πθ), θ θ , (4)

where Jν(π) := Es0 ν,π[P

h rh] is the expected return of π in M with initial state distribution ν.

The lemma preconditions are identical to those required for value-based analysis [BR24]. Bπ(π ) is a one-step improvement objective with respect to the occupancies and value functions of π, and we require (1) the policy class to be expressive enough that it contains any maximizer; and (2) the one-step objective to itself have optimality gap upper-bounded by the one-step policy gradient magnitude, for which the constant m is determined wholly by the policy parameterization. For example, the tabular policy πθ(a|s) = θsa has m = 1 [AKLM21].

The coverage coefﬁcient Cπ is the ﬁnite-horizon counterpart to the inﬁnite-horizon exploratory initial distribution salient to the analysis of [AKLM21] and [BR24] (which lists developing it as future work). In RL, a small gradient magnitude alone does not guarantee optimality, as it can also occur when the policy rarely visits rewarding states. The coverage coefﬁcient quantiﬁes both how policy performance can suffer from insufﬁcient exploration, as well as how exploratory initializations mitigates this problem. Finally, combining Lem. 3.2 with the stationary convergence result in Cor. 3.1 shows that, on average, the best-iterate of OCCUPG is near-optimal. Corollary 3.2. Under the preconditions of Lem. 3.2 and Cor. 3.1, running OCCUPG2 with initial

distribution ν satisﬁes E[mint J(π ) J(π(t))] ε when T = e O βB2(Cπ )2m2H2

e O B2(Cπ )2m2pd GH6G2 log(T )

3.3 Optimization of general functionals

One standout feature of OCCUPG is that it can, with a one-line change, be adapted for policy optimization of any (differentiable) objective function involving occupancies. We work with JF (π) = P

h Fh(dπ h) as a representative formula, where Fh : (S) R is a general functional. Such objectives often evade value-based PG optimization because they do not admit value functions or Bellman-like recursions with which to compute them. Examples include entropy maximization

2This means that trajectories are generated by ﬁrst sampling the initial state s1 ν, then rolling out the policy according to the true MDP s dynamics.

where Fh(d) = d, log d ; imitation learning where Fh(d) = d dπE h 2 2 for an expert policy πE; and the expected return with Fh(d) = d, R [MDSDBR22].

The policy gradient is then JF (π) = P

d(s) |d=dπ h log dπ h(s) i . Implementationwise, we need only change Line 7 in OCCUPG to accommodate the new gradient formula, to b JF (π) = 1 n P

s Dh bgπ h(s) Fh(d)

d(s) |d= b dπ h . The partial derivative of Fh is evaluated with a

plug-in occupancy estimate bdπ that can be obtained using maximum likelihood estimation (App. D). Notably, the occupancy gradient estimation module for bgπ h log dπ h (Line 5) is reused verbatim. Given their resemblance to those in Sec. 3.2, the full algorithm and analyses are deferred to App. B.5.

4 Ofﬂine Occupancy-based PG

In this section, we develop an algorithm for occupancy-based policy optimization in the ofﬂine setting, where only ﬁxed datasets are available for learning. A direct modiﬁcation of OCCUPG, e.g., by converting occupancies to density ratios over the ofﬂine data distribution, will fail unless the data covers all possible policies, otherwise the density ratio may be unbounded. In-line with recent state-of-the-art ofﬂine RL algorithms, our goal is to establish an ofﬂine PG algorithm that adapts to and retains meaningful guarantees under arbitrary ofﬂine datasets, for which our key consideration is establishing an ofﬂine gradient estimation method. We begin by deﬁning these ofﬂine datasets. Deﬁnition 4.1. The ofﬂine dataset is D = {Dh}, where Dh = {(sh, ah, sh+1, rh+1))}n i=1 is generated i.i.d. as sh d D h for some d D h (S) and ah πD h ( |sh) in M, for a known behavior policy πD h . The marginal next-state distribution in Dh is denoted as d D, h (sh+1).

Def. 4.1 is more general than the typical i.i.d. trajectory setting [KU20; NZJZW22], where d D h = d D, h 1. Crucially, unlike previous works that require lower-bounded d D or all-policy coverage [KU20; XYWL21; NZJZW22], we will make no assumptions about the quality of D with respect to ΠΘ.

Additional notation. For short, we say EDh[ ] E(sh,ah,sh+1,rh+1) Dh[ ], and use (s, a, s , r ) Dh when clear from the context. For any g : S A Rp and reweighting function ρ : H S A R+, we deﬁne an ofﬂine reweighted analog to Eπ h (Lem. 3.1) for all h [H] to be

[ED,ρ h g](s ) := E(s,a,s ) Dh ρh[g(s, a)|s ] = P

s,a [Dh ρh](s,a,s ) P

s,a[Dh ρh](s,a,s ) g(s, a). (5)

The (time-reversed) conditional expectation is taken over [Dh ρh](s, a, s ) := P(s |s, a)d D h (s)πD h (a|s)ρh(s, a), the joint ofﬂine distribution re-weighted by ρh. While this may not be a valid density, its induced conditional distribution on (s, a|s ) always is, i.e., P

s,a [Dh ρh](s,a,s ) P

s,a[Dh ρh](s,a,s ) = 1. As an example, for a given π we have ED,ρ h = Eπ h when

ρh(s, a) = dπ h(s)π(a|s) d D h (s)πD h (a|s) is the policy s density ratio and is well-deﬁned.

4.1 Ofﬂine density-based policy gradient

A policy s occupancy dπ may not be covered by arbitrary ofﬂine data (Def. 4.1), so neither its expected return J(π) = P

h dπ h, R nor its gradient J(π) will be estimatable from D. As a result, there is no hope of recovering argmaxπ ΠΘ J(π). Our solution is to instead maximize return only on areas of the state space that are sufﬁciently covered by ofﬂine data, which is captured exactly by the recursively clipped occupancy dπ from [HCJ23]. It clamps the policy occupancy to preset multiples Cs h, Ca h of the ofﬂine data distribution, thereby representing only the sufﬁciently covered portion. Deﬁnition 4.2 (Recursively clipped occupancy). Let ( ) := min{ , }. Given clipping constants {Cs h, Ca h} 1, deﬁne the clipped policy to be πh = π Ca hπD h , and recursively deﬁne

dπ h = P πh 1 dπ h 1 Cs h 1d D h 1 , h [H]. (6)

Eq. (6) resembles the Bellman ﬂow equation with clipped policy π, and acts on the previous-timestep dπ h 1 clipped to at most Cs h 1d D h 1. Above this threshold the occupancy is considered to be insufﬁciently covered for estimation, and Cs strikes a bias-variance tradeoff between the amount of

clipped mass vs. distribution shift. The clipped occupancy s density ratio is always well-deﬁned and bounded as dπ h/d D, h 1 Cs h 1Ca h 1, and we use it to deﬁne our (now learnable) ofﬂine objective,

sh dπ h(sh)R(sh) = P

dπ h(sh) d D, h 1(sh)R(sh) .

For any fully covered policy with dπ h Cs h 1Ca h 1d D, h 1 for all h [H], we have dπ = dπ and J(π) = J(π). In this sense, argmaxπ J(π) will be at least as good as the best policy fully covered by ofﬂine data. Next, deﬁne the density ratio be wπ h := dπ h/d D, h 1. The gradient of J(π) is

h EDh 1 wπ h(sh)R(sh) log dπ h(sh) .

To calculate this gradient we must compute both wπ and log dπ; for the former, [HCJ23] provides a method that we will later call as a subroutine. Our focus is on computing log dπ h, which is enabled by the following recursive equation, which is an ofﬂine analog of Lem. 3.1.

Lemma 4.1. For any π and all h [H], deﬁne ρπ h(s, a) := ( dπ h(s) Cs hd D h (s)) d D h (s) πh(a|s) πD h (a|s). Then

log dπ h = ED, ρπ

h 1 log π 1[π Ca hπD h 1] + log dπ h 1 1[ dπ h 1 Cs hd D h 1] , (7)

where ED, ρπ

h 1 is from Eq. (5), and [M v]( ) := v( )M( ) Rp for M : p and v : R.

Lem. 4.1 is derived from applying the chain rule to Def. 4.2, and the clipped occupancies play an instrumental role in handling insufﬁcient ofﬂine coverage. Notably, the indicator function zeroes-out both the gradients log π and log dπ h 1 where they are insufﬁciently covered, e.g., dπ h 1(s) > Cs h 1d D h 1(s). Further, under full ofﬂine coverage we recover Lem. 3.1 and log dπ = log dπ.

Because the rewards are nonnegative, log dπ induces a pessimistic policy gradient that shifts policies away from out-of-distribution actions, even if they generate high return. This is seen more clearly in Prop. 4.1, that rearranges the resulting expression for J(π) into a value-based form: Proposition 4.1. We can equivalently write

h EDh[ ρπ h(s, a) log πh(a|s) Qπ h(s, a)],

where Qπ is a pessimistic value function that obeys the Bellman-like recursion Qπ h(s, a) =

1[π Ca hπD h ](a|s) P

s P(s |s, a) R(s ) + 1[ dπ h+1 Cs h+1d D h+1](s ) Qπ h+1(s , πh+1) .

In Qπ, future returns are zeroed out at states and actions that exceed the threshold of data coverage, due to indicators functions that are inherited from log dπ. Prop. 4.1 can be seen as a pessimistic ofﬂine analog to the classical PG theorem J(π) = P h Es,a dπ h [ log π(a|s)Qπ h(s, a)] [SMSM99], entirely induced by the deﬁnition of the clipped occupancy.

Non-robustness of log dπ estimation to plug-in densities. With ﬁnite samples, however, it turns out that consistent estimates of log dπ h in Eq. (6) cannot be computed. To make this argument, we ﬁrst outline the high-level gradient estimation procedure for a ﬁxed policy:

Estimate occupancies {bdπ h} and {bd D h } Compute b log dπ h using Eq. (7) with plug-in indicator function estimate 1[bdπ h 1 Cs h bd D h 1]

The problem arises in step two, as 1[ ] is a stepwise function and not smooth. Even if bdπ is vanishingly close to dπ, the gradient calculated from plug-in occupancy estimates can have constant error. Proposition 4.2. There exists an MDP and policy π such that, for any ε > 0, maxh,s log dπ h(s) b log dπ h(s) = O(1) when dπ h bdπ h 1 ε and bd D h d D h 1 ε for all h.

4.2 Smooth clipping

To resolve this issue, we will use a smooth-clipping function σ (x, c) to approximate the hard - clipping (x c) in Eq. (6), whose non-smooth gradient was the source of our estimation problems. Figure 1 plots 1-D examples of σ (x, c) against (x c) as reference (dashed), and Asm. 4.1 describes the properties of σ that enable our later estimation and convergence guarantees.

1.0 (x, c = 0.5)

1/2 1/4 1/8 1/16 0

I (x, c = 0.5)

2 4 8 16 Figure 1: We plot σ(x, c) from Prop. 4.3 for different b, that trade-off between clipping approximation error and smoothness (Dσ 1/Lσ).

Assumption 4.1. Assume that σ satisﬁes x, x , c, c dom(σ),

1. (Approximate clipping) Dσ 0 such that 0 (x c) σ (x, c) Dσ (x c).

2. (Monotonicity) σ (x , c) σ (x, c) if x x; σ (x, c ) σ (x, c) if c c; and vice versa.

3. (Smooth gradient) Deﬁne the smoothed indicator 1 (x, c) := x x log σ (x, c), where x is the partial derivative w.r.t. x. Then 1 (x, c) [0, 1] and Lσ 0 s.t. x, x , c, c dom(σ),

c| 1 (x, c) 1 (x , c) | Lσ|x x |, and x| 1 (x, c) 1 (x, c ) | Lσ|c c |.

Note that σ (x, c) = (x c) is a special case with 1 (x, c) = 1[x c], thus Dσ = 0 and Lσ = . The following choice of σ, which is plotted in Fig. 1, fulﬁlls Asm. 4.1.

Proposition 4.3. For any b > 1, σ (x, c) = x b + c b 1/b has Lσ = b and Dσ = 1/b.

Next, we deﬁne the smooth-clipped occupancy function edπ h, which is no larger than dπ h. Deﬁnition 4.3 (Recursively smooth-clipped occupancy). For smooth-clipping function σ satisfying Asm. 4.1 and clipping constants {Cs h, Ca h}, deﬁne eπh := σ π, Ca hπD h , and inductively set

edπ h = Peπh 1 σ edπ h 1, Cs h 1d D h 1 , h [H]. (8)

Then letting ewπ h := edπ h/d D, h 1, our new objective is e J(π) = P

h EDh 1[ ewπ h(sh)R(sh)] with gradient e J(π) = P

h EDh 1[ ewπ h(sh)R(sh) log edπ h(sh)], where log edπ h obeys the following recursion.

Lemma 4.2. For σ satisfying Asm. 4.1, recall 1 (x, c) := x x log σ (x, c) . Then for all h [H],

log edπ h = ED,eρπ

h 1 log π 1 π, Ca h 1πD h 1 + log edπ h 1 1 edπ h 1, Cs h 1d D h 1 , (9)

where eρπ h 1(s, a) := σ( e dπ h 1(s),Cs h 1d D h 1(s)) d D h 1(s) eπh 1(a|s) πD h 1(a|s) and ED,eρπ

h 1 is deﬁned in Eq. (5). Further,

under Asm. 2.1, maxs,h log edπ h(s) h G.

Eq. (9) replaces the (non-smooth) indicator function in log dπ (Lem. 4.1) with its smooth approximation 1, which, as we will show shortly, enables robust gradient estimation with plug-in occupancy estimates. As before, we can reduce it to squared-loss regression (Eq. (11)). Further, by optimizing e J(π), we also approximately maximize our target objective J(π), with bias proportional to Dσ.

Proposition 4.4. Under Asm. 4.1, 0 maxπ ΠΘ J(π) maxπ ΠΘ e J(π) H2Dσ.

4.3 Ofﬂine smooth-clipped gradient estimation

Alg. 2 describes the ofﬂine PG algorithm for optimizing e J(π). To reduce clutter, we have used log eπh := log π 1 π, Ca hπD h . First, OFF-OCCUPG estimates d D h 1 using MLE (details in App. D due to space constraints). Then, for each iteration t, it estimates the smooth-clipped occupancy edπ(t) h using FORC (adapted from [HCJ23], see App. E). This is plugged into a squaredloss regression problem approximating Eq. (9) to learn log ed(t) h (lines 8 to 10), then estimate e J(π(t)) (line 12).

Algorithm 2 OFF-OCCUPG: Ofﬂine Occupancy-based Policy Gradient Input: data D; iters T; learning rate η; function classes ΠΘ, F, W, G; clipping constants {Cs h Ca h} 1: Split D equally into Dmle, DFORC, Dreg, Dgrad, each with n samples. 2: Estimate {bd D h , bd D, h } MLE Dmle, F

// Alg. 4 3: for t = 0, . . . , T 1 do

4: Estimate { bw(t) h } FORC π(t), DFORC, W, {bd D h , bd D, h }

5: Set occupancy estimate bd(t) h = bw(t) h bd D, h 1 for all h [H].

6: Initialize bg(t) 0 = 0. 7: for h = 1, . . . , H do

8: Set density ratio bρ(t) h 1 = eπ(t) h 1 πD h 1

σ b d(t) h 1,Cs h 1 b d D h 1

b d D h 1 .

9: Set gradient regression target by(t) h 1 = bg(t) h 1 1 bd(t) h 1, Cs h 1 bd D h 1 .

10: Let e L(t) h 1(g; y, ρ) := 1

(s,a,s ) Dreg h 1 ρ(s, a) g(s ) ( log eπ(t) h 1(a|s)+y(s)) 2. Solve

bg(t) h = argmingh Gh e L(t) h 1(gh; by(t) h 1, bρ(t) h 1) (10)

11: end for 12: Set b e J(π(t)) = 1

(s,a,s ,r ) Dgrad h bw(t) h (s ) bg(t) h (s ) r

13: Update θ(t+1) = Projθ(θ(t) + η b e J(π(t))). 14: end for

Before stating the estimation guarantee for e J(π), we ﬁrst introduce the required assumptions. For simplicity, we assume that the function classes used in MLE and FORC are ﬁnite, and defer their guarantees to the respective appendices, as they have been well-established in previous papers [AKKS20; HCJ23]. We focus on discussing Asm. 4.2 for the ofﬂine gradient function class, which requires a stronger level of expressiveness. Since the regression target in OFF-OCCUPG involves plug-in occupancy estimates, the completeness condition naturally requires Gh to express the gradient update in Lem. 4.2 for all possible targets composed of functions from Fh 1, Wh 1, Gh 1. As a result, Asm. 4.2 is generally stronger than Asm. 3.1 for OCCUPG.

Assumption 4.2. For all h, supg Gh gh h G; and for all (π, g, f, f , w) ΠΘ Gh 1 Fh Fh 1 Wh 1, we have ED,ρ h 1( log eπh 1 + g 1 wf , Cs h 1f ) Gh, where ρ =

σ(wf ,Cs hf) f eπh 1 πD h 1 .

When the underlying MDP has favorable structure, however, we can expect that d G is not much larger than was required for OCCUPG. This is indeed the case in low-rank MDPs, where the G deﬁned in Rem. 3.1 also satisﬁes Asm. 4.2 (proof in Prop. C.1). Due to the bilinear transition structure, the ofﬂine gradient update (Lem. 4.2) applied to any target remains a linear-over-linear function.

The guarantees for the MLE (Alg. 4) and weight estimation (Alg. 5) subroutines require Asm. D.1 and Asm. E.1, respectively, which are included in the preconditions of the main result below. Brieﬂy, Asm. D.1 requires F to realize the true data distributions d D h and d D, h , which is standard in supervised learning. Asm. E.1 requires W to be closed under the Bellman ﬂow operator, and can be viewed as a 1-dimensional version of Asm. 4.2 where ρ = 1. In this sense both assumptions are weaker requirements on expressivity than that of the gradient class in Asm. 4.2, and more detailed discussions are left to App. D and App. E.

Having established its preconditions, we now present our main estimation guarantee for OFFOCCUPG, which pays additional factors for the coverage of ofﬂine data (P

h Cs h Ca h) and the smoothness of σ.

Theorem 4.1. Suppose e J( ) satisﬁes Asm. 3.2 and ﬁx π ΠΘ. Under Asm. 2.1, Asm. 4.1, Asm. 4.2, Asm. D.1, and Asm. E.1, w.p. 1 δ we have e J(π) b e J(π) ε when

n = e O pd GH6G2( P

h Cs h Ca h) 2L2 σ log(|W||F|/δ) ε2

Stationary convergence & computational efﬁciency. Similar to OCCUPG, OFF-OCCUPG with T = O(βH2/ε2) converges to an ε-stationary point. The formal statement is given in Cor. C.1 and is based on the estimation guarantee in Thm. 4.1. As a result, OFF-OCCUPG is also computationally oracle-efﬁcient. Each invocation of MLE involves 2H calls to a likelihood maximization oracle (see Alg. 4), and each invocation of FORC requires H calls to a squared-loss regression oracle (see Alg. 5). Then local convergence is still achieved with O(βH2/ε2) such calls, as increasing T further cannot reduce error from statistical noise (that depends only on the ﬁxed n).

Optimality. Analyzing the conditions under which ofﬂine PG recovers global optima is more challenging, as we can no longer utilize exploratory initialization (from Cor. 3.2). However, since all occupancies have been clipped to the data distribution, we show in App. C.5 that the ofﬂine data itself can sometimes sufﬁce as an exploratory initial distribution, and the corresponding bound is in terms of {Cs h} (instead of the online Cπ ). However, this is not guaranteed in general and our current result only holds under strong all-policy ofﬂine data coverage. Brieﬂy, some hardness comes from the fact that clipping causes gradient signals to vanish, so a stationary policy might be far offsupport, rather than optimal. Investigating the possibility of more relaxed conditions for ofﬂine PG convergence (or, conversely, reﬁning hardness results) are especially interesting directions for future work.

5 Conclusion

For the ﬁrst time, we demonstrate how policy optimization can be conducted with (only) occupancy functions for both online and ofﬂine RL, and comprehensively analyze both local and global convergence. In the online setting our method directly extends to optimizing general objective functionals that cannot be optimized using value-based methods, and in the ofﬂine setting the occupancy-based gradient naturally handles incomplete ofﬂine data coverage. As our work is the ﬁrst in this line of research and theoretical in nature, for future work we plan to launch empirical investigations of our methods, especially those for optimizing general functionals. Additionally, the conditions under which ofﬂine PG can converge to global optima is not well-understood, and we hope that our preliminary results here encourage greater interest and investigation into this question.

Acknowledgements

Nan Jiang acknowledges funding support from NSF IIS-2112471, NSF CAREER IIS-2141781, Google Scholar Award, and Sloan Fellowship.

[AFK24] Philip Amortila, Dylan J Foster, and Akshay Krishnamurthy. Scalable Online Exploration via Coverability . In: ar Xiv preprint ar Xiv:2403.06571 (2024). [AKKS20] Alekh Agarwal, Sham Kakade, Akshay Krishnamurthy, and Wen Sun. Flambe: Structural complexity and representation learning of low rank mdps . In: Advances in Neural Information Processing Systems (2020). [AKLM21] Alekh Agarwal, Sham M Kakade, Jason D Lee, and Gaurav Mahajan. On the theory of policy gradient methods: Optimality, approximation, and distribution shift . In: The Journal of Machine Learning Research 22.1 (2021), pp. 4431 4506. [ASM07] András Antos, Csaba Szepesvári, and Rémi Munos. Fitted Q-iteration in continuous action-space MDPs . In: Advances in neural information processing systems 20 (2007). [Bec17] Amir Beck. First-order methods in optimization. SIAM, 2017.

[BFH23] Anas Barakat, Ilyas Fatkhullin, and Niao He. Reinforcement learning with general utilities: Simpler variance reduction and large state-action space . In: International Conference on Machine Learning. PMLR. 2023, pp. 1753 1800. [BR24] Jalaj Bhandari and Daniel Russo. Global optimality guarantees for policy gradient methods . In: Operations Research (2024). [CC97] Xi-Ren Cao and Han-Fu Chen. Perturbation realization, potentials, and sensitivity analysis of Markov processes . In: IEEE Transactions on Automatic Control 42.10 (1997), pp. 1382 1393. [CJ19] Jinglin Chen and Nan Jiang. Information-Theoretic Considerations in Batch Reinforcement Learning . In: International Conference on Machine Learning. 2019. [CJ22] Jinglin Chen and Nan Jiang. Ofﬂine reinforcement learning under value and density-ratio realizability: the power of gaps . In: Uncertainty in Artiﬁcial Intelligence. PMLR. 2022, pp. 378 388. [DWS12] Thomas Degris, Martha White, and Richard S Sutton. Off-policy actor-critic . In: ar Xiv preprint ar Xiv:1205.4839 (2012). [FR20] Dylan Foster and Alexander Rakhlin. Beyond ucb: Optimal and efﬁcient contextual bandits with regression oracles . In: International Conference on Machine Learning. PMLR. 2020, pp. 3199 3210. [GL16] Saeed Ghadimi and Guanghui Lan. Accelerated gradient methods for nonconvex nonlinear and stochastic programming . In: Mathematical Programming 156.1-2 (2016), pp. 59 99. [HCJ23] Audrey Huang, Jinglin Chen, and Nan Jiang. Reinforcement Learning in Low Rank MDPs with Density Features . In: ar Xiv preprint ar Xiv:2302.02252 (2023). [HDGP24] Jia Lin Hau, Erick Delage, Mohammad Ghavamzadeh, and Marek Petrik. On Dynamic Programming Decompositions of Static Risk Measures in Markov Decision Processes . In: Advances in Neural Information Processing Systems 36 (2024). [HJ22a] Audrey Huang and Nan Jiang. Beyond the Return: Off-policy Function Estimation under User-speciﬁed Error-measuring Distributions . In: Advances in Neural Information Processing Systems. 2022. [HJ22b] Jiawei Huang and Nan Jiang. On the convergence rate of off-policy policy optimization methods with density-ratio correction . In: International Conference on Artiﬁcial Intelligence and Statistics. PMLR. 2022, pp. 2658 2705. [HKSVS19] Elad Hazan, Sham Kakade, Karan Singh, and Abby Van Soest. Provably efﬁcient maximum entropy exploration . In: International Conference on Machine Learning. PMLR. 2019, pp. 2681 2691. [HM17] Assaf Hallak and Shie Mannor. Consistent on-line off-policy evaluation . In: International Conference on Machine Learning. PMLR. 2017, pp. 1372 1383. [JKALS17] Nan Jiang, Akshay Krishnamurthy, Alekh Agarwal, John Langford, and Robert E Schapire. Contextual decision processes with low Bellman rank are PAClearnable . In: International Conference on Machine Learning. 2017. [JLM21] Chi Jin, Qinghua Liu, and Sobhan Miryooseﬁ. Bellman Eluder dimension: New rich classes of RL problems, and sample-efﬁcient algorithms . In: Advances in Neural Information Processing Systems. 2021. [JYWJ20a] Chi Jin, Zhuoran Yang, Zhaoran Wang, and Michael I Jordan. Provably efﬁcient reinforcement learning with linear function approximation . In: Conference on Learning Theory. 2020. [JYWJ20b] Chi Jin, Zhuoran Yang, Zhaoran Wang, and Michael I Jordan. Provably efﬁcient reinforcement learning with linear function approximation . In: Conference on learning theory. PMLR. 2020, pp. 2137 2143. [KJDC24] Pulkit Katdare, Anant Joshi, and Katherine Driggs-Campbell. Towards Provable Log Density Policy Gradient . In: ar Xiv preprint ar Xiv:2403.01605 (2024). [KU20] Nathan Kallus and Masatoshi Uehara. Statistically efﬁcient off-policy policy gradients . In: International Conference on Machine Learning. PMLR. 2020, pp. 5089 5100.

[LLTZ18] Qiang Liu, Lihong Li, Ziyang Tang, and Dengyong Zhou. Breaking the curse of horizon: Inﬁnite-horizon off-policy estimation . In: Advances in neural information processing systems 31 (2018). [LNSJ23] Qinghua Liu, Praneeth Netrapalli, Csaba Szepesvari, and Chi Jin. Optimistic mle: A generic model-based algorithm for partially observable sequential decision making . In: Proceedings of the 55th Annual ACM Symposium on Theory of Computing. 2023, pp. 363 376. [LSAB19] Yao Liu, Adith Swaminathan, Alekh Agarwal, and Emma Brunskill. Offpolicy policy gradient with state distribution correction . In: ar Xiv preprint ar Xiv:1904.08473 (2019). [MBFR24] Zak Mhammedi, Adam Block, Dylan J Foster, and Alexander Rakhlin. Efﬁcient model-free exploration in low-rank mdps . In: Advances in Neural Information Processing Systems 36 (2024). [MDSDBR22] Mirco Mutti, Riccardo De Santi, Piersilvio De Bartolomeis, and Marcello Restelli. Challenging common assumptions in convex reinforcement learning . In: Advances in Neural Information Processing Systems 35 (2022), pp. 4489 4502. [MHKL20] Dipendra Misra, Mikael Henaff, Akshay Krishnamurthy, and John Langford. Kinematic state abstraction and provably efﬁcient rich-observation reinforcement learning . In: International conference on machine learning. 2020. [MT01] Peter Marbach and John N Tsitsiklis. Simulation-based optimization of Markov reward processes . In: IEEE Transactions on Automatic Control 46.2 (2001), pp. 191 209. [NCDL19] Oﬁr Nachum, Yinlam Chow, Bo Dai, and Lihong Li. Dualdice: Behavior-agnostic estimation of discounted stationary distribution corrections . In: Advances in neural information processing systems 32 (2019). [NDKCLS19] Oﬁr Nachum, Bo Dai, Ilya Kostrikov, Yinlam Chow, Lihong Li, and Dale Schuurmans. Algaedice: Policy gradient from arbitrary experience . In: ar Xiv preprint ar Xiv:1912.02074 (2019). [NZJZW22] Chengzhuo Ni, Ruiqi Zhang, Xiang Ji, Xuezhou Zhang, and Mengdi Wang. Optimal Estimation of Policy Gradient via Double Fitted Iteration . In: International Conference on Machine Learning. PMLR. 2022, pp. 16724 16783. [SB18] Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction. MIT press, 2018. [SMSM99] Richard S Sutton, David Mc Allester, Satinder Singh, and Yishay Mansour. Policy gradient methods for reinforcement learning with function approximation . In: Advances in neural information processing systems 12 (1999). [UHJ20] Masatoshi Uehara, Jiawei Huang, and Nan Jiang. Minimax weight and q-function learning for off-policy evaluation . In: International Conference on Machine Learning. PMLR. 2020, pp. 9659 9668. [UIJKSX21] Masatoshi Uehara, Masaaki Imaizumi, Nan Jiang, Nathan Kallus, Wen Sun, and Tengyang Xie. Finite sample analysis of minimax ofﬂine reinforcement learning: Completeness, fast rates and ﬁrst-order efﬁciency . In: ar Xiv preprint ar Xiv:2102.02981 (2021). [XCJMA21] Tengyang Xie, Ching-An Cheng, Nan Jiang, Paul Mineiro, and Alekh Agarwal. Bellman-consistent pessimism for ofﬂine reinforcement learning . In: Advances in neural information processing systems 34 (2021). [XFBJK22] Tengyang Xie, Dylan J Foster, Yu Bai, Nan Jiang, and Sham M Kakade. The role of coverage in online reinforcement learning . In: ar Xiv preprint ar Xiv:2210.04157 (2022). [XJ21] Tengyang Xie and Nan Jiang. Batch value-function approximation with only realizability . In: International Conference on Machine Learning. 2021. [XYWL21] Tengyu Xu, Zhuoran Yang, Zhaoran Wang, and Yingbin Liang. Doubly robust off-policy actor-critic: Convergence and optimality . In: International Conference on Machine Learning. PMLR. 2021, pp. 11581 11591. [ZBWK20] Junyu Zhang, Amrit Singh Bedi, Mengdi Wang, and Alec Koppel. Cautious reinforcement learning via distributional risk in the dual domain . In: ar Xiv preprint ar Xiv:2002.12475 (2020).

[ZHHJL22] Wenhao Zhan, Baihe Huang, Audrey Huang, Nan Jiang, and Jason Lee. Ofﬂine reinforcement learning with realizability and single-policy concentrability . In: Conference on Learning Theory. PMLR. 2022, pp. 2730 2775. [ZLKB20] Andrea Zanette, Alessandro Lazaric, Mykel J Kochenderfer, and Emma Brunskill. Provably efﬁcient reward-agnostic navigation with linear value iteration . In: Advances in Neural Information Processing Systems. 2020.

A Related work 14

B Additional results and proofs for Sec. 3 15

B.1 Proofs for Sec. 3.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

B.2 Proofs for Sec. 3.2 : Estimation and local convergence . . . . . . . . . . . . . . . 16

B.3 Proofs for Sec. 3.2: Global convergence . . . . . . . . . . . . . . . . . . . . . . . 17

B.4 Examples of gradient function class G . . . . . . . . . . . . . . . . . . . . . . . . 19

B.5 Policy optimization of general functionals . . . . . . . . . . . . . . . . . . . . . . 20

C Additional results and proofs for Sec. 4 21

C.1 Proofs for Sec. 4.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

C.2 Proofs for Sec. 4.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

C.3 Proofs for Sec. 4.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

C.4 Local convergence of OFF-OCCUPG . . . . . . . . . . . . . . . . . . . . . . . . . 30

C.5 Global convergence of OFF-OCCUPG . . . . . . . . . . . . . . . . . . . . . . . . 32

C.6 Proofs for App. C.5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

D Maximum Likelihood Estimation 38

E Ofﬂine Density Estimation 39

F Probabilistic Tools 41

G Optimization Tools 43

A Related work

In this section, we discuss related works in greater detail that concern the convergence and estimation of policy gradient in RL.

While a handful of recent papers have similarly observed that the gradient of the log density can be utilized to compute the policy gradient, especially in the context of using it to optimize general functionals, none of them have analyzed methods that are sample-efﬁcient under general function approximation. In particular, [KJDC24] requires on-policy sampling from the time-reversed transition (s, a|s ), which, as they note, is highly restrictive. To overcome this issue they propose a min-max algorithm that converges under linear approximation, which is computationally a far more difﬁcult to solve (under a more stringent structural assumption) than the regression objective in Alg. 1. Similarly, [BFH23] consider only online policy gradient, and to handle large state spaces they use linear function approximation, which may incur a large error through bias in many settings. [ZBWK20] approach the problem of optimizing risk functionals through a primal-dual approach that involves occupancies as dual variables, but they only analyze convergence in tabular settings.

A number of works on off-policy gradient optimization utilize all three of the density ratio, value, and policy class functions to compute the gradient [NDKCLS19; HJ22b; UIJKSX21; XYWL21]. This is because the density ratios are required to handle distribution mismatches with ofﬂine data. The downside, however, is that even max-min optimization is difﬁcult, so performing optimization over all three functions requires complex optimization loops. By using simply projected gradient ascent on a policy class, our algorithms avoid such complexities and are amenable to classical convergence analysis that allow us to focus on the role of coverage coefﬁcients in our ﬁnal results.

Because the density ratio is generally not well-deﬁned with arbitrary ofﬂine data, all of these works require some form of all-policy coverage for both estimation and convergence guarantees. The weight gradient calculation in [UIJKSX21] exhibits a recursive decomposition that is related to ours. However, their formulation is not compatible with our data assumptions and they require policy coverage to be well-deﬁned. A follow-up paper in [XYWL21] uses squared-loss regression on the same updates, which is similar in ﬂavor to our gradient estimation objectives. However, they use linear function approximation for weight functions, which is not realizable in general, and also require all-policy coverage for their convergence results.

One close work of comparison is [LSAB19], whose PG algorithm uses learned density ratios to reweight the data distribution and approximate the expression of the policy gradient theorem [SMSM99]. To handle coverage issues in ofﬂine PG, they zero out portions of the trajectory that exceed data coverage, but only do this for (s, a) such that d D(s)πD(a|s) = 0. This is done (in the inﬁnite horizon setting) by resampling the dataset based on an augmented MDP where such (s, a) transition to an absorbing state However, this does not control the (potentially extremely large) variance of the estimator, e.g., on states where d D(s) 0. Their objective can be seen as a special case of J(π) with ﬁnite but extremely large choices of Cs and Ca. They show convergence to a stationary point in terms of weight and value estimation errors that are left implicit, and leave the high variance and coverage issues with ofﬂine data implicit.

Another is [DWS12], that takes the complementary approach and simply calculates the gradient averaged on d D. However, this is a biased gradient object and does not express the policy gradient of any speciﬁc function, which means its stationary point may not even exist, thus precluding convergence analysis.

The PSPI algorithm in [XCJMA21] is a policy optimization algorithm based on pessimistic value functions. Their setting is somewhat orthogonal to ours in the sense that they study values and we study occupancies, and we note that they do not perform policy optimization with respect to a standalone policy class but rather an implicit one induced by the value functions, which an be extremely large. In the value function sphere, [NZJZW22] leverage the linear structure of linear MDPS to develop closed-form gradient estimators through the value functions. They largely only analyze estimation errors and additionally require a form of all-policy coverage for their results.

Lastly, our optimality analysis builds off the results in [BR24] and [AKLM21] that analyze global optimality in the (inﬁnite horizon) online setting.

B Additional results and proofs for Sec. 3

B.1 Proofs for Sec. 3.1

Proof of Lem. 3.1 First we expand dπ h using the Bellman ﬂow equation, dπ h(s ) = P

s,a P(s |s, a)π(a|s)dπ h 1(s):

dπ h(s ) = P

s,a P(s |s, a)( π(a|s)dπ h 1(s) + π(a|s) dπ h 1(s))

s,a P(s |s, a)π(a|s)dπ h 1(s)( log π(a|s) + log dπ h 1(s)).

In the last line above we use the grad-log trick. Note that log dπ(s) is not well-deﬁned when dπ(s) = 0, but the two terms will cancel out in the above expression for this case. From Bayes theorem, Pπ(sh 1 = s, ah 1 = a|sh = s ) = P(s |s, a)π(a|s)dπ h 1(s)/dπ h(s ), thus

log dπ h(s ) = dπ h(s )/dπ h(s )

= Eπ[ log π(a|s) + log dπ h 1(s)|sh = s ]

= [Eπ h 1( log π + log dπ h 1)](s ). We use the convention that 0/0 = 0, thus log dπ h is always well-deﬁned.

Lastly, the second statement results from Lem. B.1, which shows that log dπ h(s ) is always bounded and well-deﬁned under Asm. 2.1. Lemma B.1. Under Asm. 2.1, we have maxs,h log dπ h(s) h G.

Proof. The lemma statement can be derived inductively starting from the observation that Eq. (1) is an expectation over its target functions. As a result, the maximum gradient magnitude should accrue

additively over horizons. More concretely, ﬁx h and s. Then

log edπ h(s ) = Eπ[ log π(a|s) + log dπ h 1(s)|sh = s ] log π(a|s) + log dπ h 1(s) G + log dπ h 1(s) ,

using Asm. 2.1 in the last line. Since log dπ 0 = 0 by deﬁnition, unrolling the above recursion through timesteps gives the stated result.

Lemma B.2. For g : S Rp and f : S A Rp, deﬁne the squared loss

Lh(g; f, π) = Eπ[ g(sh+1) f(sh, ah) 2].

Then for any such f,

Eπ h(f) = argmin g:S Rp Lh(g; f, π)

and log dπ h+1 = argmin g:S Rp Lh (g; log π + log dπ h, π) .

Proof of Lem. B.2. Since the objective is convex, can solve for the minimizer in closed form by taking the derivative and setting it to 0 in an element-wise manner. Fix s . Taking the gradient of Lh(g; f, π) with respect to g(s ), we have that

0 = dπ h(s )g(s ) X

s,a P(s |s, a)π(a|s)dπ h 1(s)f(s, a).

Rearranging and using the deﬁnition of Eπ h gives the result. The second statement follows from Lem. 3.1.

B.2 Proofs for Sec. 3.2 : Estimation and local convergence

Proof of Thm. 3.1 First we split up the errors contributed by regression and the estimation. Fix π, then EDreg[b J(π)] = P

h Es dπ h[bgπ h(s)Rh(s)] and

J(π) b J(π) J(π) EDreg[b J(π)] + EDreg[b J(π)] b J(π)

The ﬁrst term is related to the regression error in bgπ h approximating log dπ h,

J(π) EDreg[b J(π)] = P

h Edπ h[ log dπ h(s)Rh(s)] Edπ h[bgπ h(s)Rh(s)]

h Edπ h[ log dπ h(s) bgπ h(s)]

= P h q Pp p=1 p log dπ h [bgπ h]p 2 1,dπ h.

For a ﬁxed h and p, we recursively decompose

p log dπ h [bgπ h]p 1,dπ h p log dπ h [Eπ h 1( log π + bgπ h 1)]p 1,dπ h + [Eπ h 1( log π + bgπ h 1)]p [bgπ h]p 1,dπ h p log dπ h 1 [bgπ h 1]p 1,dπ h 1 + [Eπ h 1( log π + bgπ h 1)]p [bgπ h]p 2,dπ h,

using the fact that log dπ h = Eπ h 1( log π + log dπ h 1) in the second line. Then unrolling the recursion, we have

p log dπ h [bgπ h]p 1,dπ h X

h [Eπ h 1( log π + bgπ h 1)]p [bgπ h]p 2,dπ h

Applying Lem. F.2 (more exactly, this is an ofﬂine version but we invoke it with ρ = 1 and no clipping for the online setting) with δ = δ/2Hp and a union bound over all h and p, we have

[Eπ h 1( log π + bgπ h 1)]p [bgπ h]p 2 2,dπ h = E[Lp Dreg h 1(bgπ h, bgπ h 1) Lp Dreg h 1(Eπ h 1( log π + bgπ h 1), bgπ h 1)]

2(εreg h 1)2,

where εreg h 1 = q

cd Gh2G2 log(2Hp/δ)

n . Then for any h, p we have

p log dπ h [bgπ h]p 1,dπ h

2cd Gh4G2 log(2Hp/δ)

J(π) EDreg[b J(π)] p H p log dπ H [bgπ H]p 1,dπ H =

2cpd GH6G2 log(2Hp/δ)

For the second term,

|EDreg[b p J(π)] b p J(π)| X

h |Es dπ h[bgπ h(s)Rh(s)] 1

s Dh bgπ h(s)Rh(s)| H2G

log(2p H/δ)

where we use Hoeffding s inequality with union bound, for all h [H] and p p in the last line, given that the randomness of bg is independent given Dreg h . Thus

EDreg[b J(π)] b J(π) H2G

p log(2p H/δ)

n Combining the two terms, our ﬁnal bound is

J(π) b J(π)

pd GH6G2 log(2Hp/δ)

with the regression error dominating.

Proof of Cor. 3.1 For any ﬁxed run of Alg. 1, calling Thm. 3.1 for π(t) with δ = δ/T and taking a union bound over T gives with probability at least 1 δ that

J(π(t)) b J(π(t))

pd GH6G2 log(2Hp T/δ)

Then setting δ = 1/ n, we have

E h J(π(t)) b J(π(t)) i

pd GH6G2 log(2Hp Tn)

where the expectation is over random samples in Dreg, Dest. Finally, plugging this into the PGD stationary convergence bound in Lem. G.1 gives

t E h Gη(π(t), J(π(t))) 2i 4βH

T + 6pd GH6G2 log(2Hp Tn)

Setting the RHS to ε and setting T, n appropriately gives the result.

B.3 Proofs for Sec. 3.2: Global convergence

We will establish the conditions under which J(π) satisﬁes a gradient domination property, meaning that for any θ Θ, the suboptimality of πθ is bounded by some function S that includes a measure of its stationarity, i.e., maxπ ΠΘ e J(π ) J(πθ) S( e J(πθ)). This combined with the sample complexity bounds for stationary convergence established in Cor. 3.1 enables our global optimality result in Cor. 3.2.

Though we are concerned with optimizing J(π) induced by running π starting from initial distribution d0, it will be useful to consider performing Alg. 1 using a different exploratory initial distribution µ (S). By exploratory, we mean that we allow µ(s) > 0 for all s S, unlike d0 (S0). In the (stationary) inﬁnite horizon this is a common trick for obtaining well-deﬁned gradient domination bounds [AKLM21], but its ﬁnite-horizon (nonstationary) counterpart is nontrivial and to our knowledge has not previously been formalized (it is listed as future work in [BR24]).

We state and prove a more general version of Lem. 3.2:

Lemma B.3. For any π and π , deﬁne Bπ(π ) := P h,s,a dπ h(s)π (a|s)Qπ h(s, a). Suppose π ΠΘ,

1. (Policy completeness) There exists π+ ΠΘ such that π+ argmaxπ Bπ(π ). 2. (Gradient domination) maxπ ΠΘ Bπ(π ) Bπ(π) m maxθ Θ Bπ(π), θ θ

Given ν (S), deﬁne the coverage coefﬁcient Cπ := P

h dπ h /ν for π = argmaxπ J(π). Then for any πθ ΠΘ,

J(π ) J(πθ) m

max θ ΠΘ Jµ(πθ), θ θ

Cπ max θ ΠΘ Jν(πθ), θ θ ,

where Jν(π) := Es0 ν,π[P

h rh] is the expected return of π in M with initial state distribution ν.

Proof of Lem. B.3 First we note two facts that hold regardless of M. We have Qπ g (sh, ah) = Qπ h(sh, ah) for any g h, and dπ g (sh) = 0 if g > h.

J(π ) J(πθ) =

s,a dπ h (s) (π (a|s) πθ(a|s)) Qπθ h (s, a)

Then we will write Qπ(s, a) Qπ h(s, a), thus

J(π ) J(πθ) =

h=0 dπ h (s) (π (a|s) πθ(a|s)) Qπθ(s, a)

(π (a|s) πθ(a|s)) Qπθ(s, a)

! π+(a|s) πθ(a|s) Qπθ(s, a)

h dπ h (s) P

h dπθ µ,h(s)

h dπθ µ,h(s)

! π+(a|s) πθ(a|s) Qπθ(s, a)

h dπθ µ,h(s)

! π+(a|s) πθ(a|s) Qπθ(s, a)

For the RHS, observe that dπθ µ,g(sh) = 0 for g > h. Then X

s,a dπθ µ,h(s) π+(a|s) πθ(a|s) Qπθ(s, a)

s,a dπθ µ,h(s) π+(a|s) πθ(a|s) Qπθ h (s, a)

s,a dπθ µ,h(s) π+(a|s) πθ(a|s) Qπθ µ,h(s, a)

= Bπθ(π+) Bπθ(πθ)

m max θ Θ Bπθ(πθ), θ θ

= m max θ Θ Jµ(πθ), θ θ

Combining the two inequalities results in the ﬁnal guarantee.

Proof of Cor. 3.2 Fix {π(t)}t [T ] from Alg. 1. Then for any t [T], from Lem. 3.2 we have

J(π ) J(π(t)) m Cπ max θ ΠΘ

D J(π(t)), θ θ E

Bm Cπ Gη(π(t)) (Lem. G.4)

Then summing through T and taking an expectation over the randomness in the algorithm, we have

t J(π ) J(π(t))

T + 6pd GH6G2 log(2Hp Tn)

. (Cor. 3.1)

B.4 Examples of gradient function class G

This section contains formal statements of the claims in Rem. 3.1, and their proofs. We begin by deﬁning the low-rank MDP, noting that for notational compactness we have dropped the features h-dependence given our assumption that there is a one-to-one correspondence between states and the timestep at which they are visited. Deﬁnition B.1 (Low-rank MDP). We say M is a low-rank MDP with dimension k if h [H], there exists ϕ : S A Rk and µh : S Rk such that (s, a, s ), we have P(s |s, a) = ϕ(x, a), µ(x ) . Further, ϕ Cϕ and P

Prop. B.1 shows that, in low-rank MDPs, a linear-over-linear parameterization for the gradient function class satisﬁes the completeness requirement in Asm. 3.1, with pseudo-dimension linear in the low-rank dimension and the parameter dimension, i.e., d Gh = O(kp). Proposition B.1. Suppose M is a low-rank MDP (Def. B.1), and suppose µ is known. For each layer h, deﬁne the function class

Gh = gh = µ Ψ

µ ψ : Ψ Rk p, ψ Rk, gh h G, h [H] .

Then {Gh} satisﬁes Asm. 3.1 and has pseudodimension (Def. F.1) d Gh = O(kp).

Proof of Prop. B.1. It sufﬁces to show that, for any function f : S A Rp and policy π, its gradient update from Lem. 3.1 is Eπ h( log π + f) Gh+1.

Since [Eπ h( log π + f)](s ) = Eπ[ log π(s, a) + f(s)|s ], from Bayes rule and the deﬁnition of the Bellman ﬂow operator (see proof of Lem. 3.1), we have

[Eπ h( log π + f)](s ) = [Pπ h 1( log π + f)](s )

First, we will show that [Pπ hf](s ) = µ(s ) Ψ for some Ψ Rk p and all s S. Below, we use f p(s) to denote the p-th parameter of f(s) Rp. For ﬁxed p [p],

[Pπ hf]p(s ) = X

s,a P(s |s, a)π(a|s) ( p log π(a|s) + f p(s))

s,a ϕ(s, a)π(a|s) ( p log π(a|s) + f p(s))

where ψ = P

s,a ϕ(s, a)π(a|s) ( log π(a|s) + f p(s)) Rk. Stacking this result for each p into the matrix Ψ shows the desired statement that [Pπ hf](s ) = µ(s ) Ψ.

We can apply similar reasoning as above in the Bellman ﬂow equation to show that dπ h(s ) = µ(s ), ψ for some θ Rk. Combined with the above, this shows that

log dπ h(s ) = µ Ψ

Combining the linear forms of the numerator and denominator reveal that log dπ h Gh. Lastly, the pseudo-dimension of Gh follows directly from applying Lemma 24 of [HCJ23], which bounds the pseudo-dimension of linear-over-linear function classes with p = 1, in all p dimensions.

Algorithm 3 Online Occupancy-based PG for General Functionals Input: Functional F = {Fh}; Samples n; iterations T; policy class ΠΘ; function class G; learning rate η; function class F; 1: for t = 0, . . . , T 1 do 2: For all h [H], deploy π(t) for 3n trajectories. Set Dreg h = {(sh, ah, sh+1)}n i=1, and similarly for Dgrad h and Dmle h 3: for h = 1, . . . , H 1 do

4: Deﬁne L(t) h 1(gh, gh 1) := 1 n P

(s,a,s ) Dreg h 1 gh(s ) log π(t)(a|s) + gh 1(s) 2, and set bg(t) h = argmingh Gh L(t) h 1(gh, bg(t) h 1).

5: end for 6: Estimate bdπ h MLE(Dmle, F). // Alg. 4

7: Estimate b JF (π) = 1

s Dest h bgπ h(s) Fh(d)

d(s) d= b dπ h 8: Update θ(t+1) = ProjΘ θ(t) + η b J(π(t)) .

B.5 Policy optimization of general functionals

Alg. 3 displays the full algorithm for optimization of general functions (described in Sec. 3.3). It shares its occupancy gradient estimation module with OCCUPG. Compared to Alg. 1, the only change is the objective gradient calculation in Line 7, which uses a plug-in estimate of the occupancy (Line 6) to evaluate the partial derivative.

Since the algorithmic change is small, the analysis for Alg. 3 requires only a few adaptations from the analysis of OCCUPG. For smooth and differentiable functionals, we provide the gradient estimation guarantee below. The smoothness ensures that using plug-in occupancy estimates to evaluate the partial derivative leads to consistent gradient estimates, and is in line with the spirit of standard objective smoothness requirements (Asm. 3.2). Assumption B.1. Suppose that for all h, Fh has a smooth gradient, i.e., for any f, f (S) that

and has bounded range Fh(d) CF . Theorem B.1. Suppose that Asm. 2.1 and Asm. B.1 hold. Fix π ΠΘ. With probability at least 1 δ,

JF (π) b JF (π) H2GLF

p log(2p H|F|/δ)

pd GH6G2 log(2Hp/δ)

When Asm. 3.2 holds, this result directly leads to a stationary convergence guarantee similar to Cor. 3.1, by union bounding Thm. B.1 over all T then plugging it into Lem. G.5 (see proof of Cor. 3.1). We expect that the global convergence in Cor. 3.2 can also be extended with little overhead when {Fh} are convex, but leave a full investigation to future work.

Proof of Thm. B.1 The analysis follows largely the same lines as the proof of Thm. 3.1. However, we must additional account for the error of approximating Fh(d)

d(s) d= b dπ h with the plug-in occupancy

estimate. This was unnecessary for the expected return in Sec. 3.2 since Fh(d)

d(s) = Rh(s) is independent of the occupancy.

First, for all h [H], with probability at least 1 δ we have occupancy estimates from Alg. 4 such that

dπ h bdπ h 1

2 log(H|F|/δ)

n := εmle, h [H].

This follows directly from Lem. D.1 with a union bound over H.

Next, we isolate the occupancy estimation-related term from the error we would like to bound. Deﬁne b JF (π) := P

d(s) |d= b dπ h log dπ h(s) i , and decompose

JF (π) b JF (π) JF (π) b JF (π) + b JF (π) b JF (π)

For the ﬁrst term,

JF (π) b JF (π) X

d(s) |d=dπ h log dπ h(s) Fh(d)

d(s) |d= b dπ h log dπ h(s)

d(s) |d=dπ h Fh(d)

d(s) |d= b dπ h

d(s) |d=dπ h Fh(d)

d(s) |d= b dπ h

H2GLF max h dπ h bdπ h 1,

H2GLF εmle.

using Asm. B.1 in the second to last inequality. This takes care of the aforementioned occupancy estimation error.

Conditioned on such {bdπ h}, the second pair of terms b JF (π) b JF (π) is analogous to the error bounded in Thm. 3.1, and the proof follows identically thereon, but with dependence on the range CF of the functionals.

C Additional results and proofs for Sec. 4

C.1 Proofs for Sec. 4.1

Proof of Lem. 4.1 By passing the gradient through the clipped Bellman ﬂow equation in Def. 4.2, we have

s,a P(s |s, a) πh 1(a|s) dπ h 1(s) + π(a|s) dπ h 1(s) Cs h 1d D h 1(s)

s,a P(s |s, a) πh 1(a|s) dπ h 1(s) Cs h 1d D h 1(s) log πh 1(a|s)

+ log dπ h 1(s) Cs h 1d D h 1(s)

Next, dropping the h 1 subscript for a moment, observe that

log dπ(s) Csd D(s) = log dπ(s), if dπ(s) < Csd D(s), 0, if dπ(s) > Csd D(s),

with a discontinuity at dπ(s) = Csd D(s). For simplicity, we set log dπ(s) Csd D(s) = log dπ(s) 1[ dπ(s) Csd D(s)]. Similarly, we have log π(a|s) = log π(a|s) 1[π(a|s) CaπD(a|s)].

Finally, log dπ h(s ) = dπ h(s )/ dπ h(s ), where dπ h(s ) = P s,a P(s |s, a)πD h 1(a|s)d D h 1(s) ρπ h 1. The lemma statement follows from the deﬁnition of ED, ρ h 1, and the gradient magnitude bound results from invoking Lem. C.2 with σ (x, c) = (x c).

Proof of Prop. 4.1 This result follows from applying Lem. C.7, which is a more general version of the proposition statement that holds for any (smooth-)clipping function, to σ (x, c) = (x c).

Proof of Prop. 4.2 The MDP we will describe corresponds to a multi-armed bandit with 2 actions. Consider an MDP with H = 2, and S0 = {s0}, S1 = {s L, s R}, S2 = {s , s+} which are terminal. In any state there are two actions, A = {L, R}, with deterministic transitions. For the ﬁrst level, we have s0 L s L and s0 R s R. For the second level, we have s L s and s R s+, regardless of action taken. For the reward function, R(s+) = 1 and is 0 otherwise.

The policy is parameterized by a single parameter θ such that π(L) = 1 θ, and π(R) = θ, such that dπθ 1 (s R) = dπθ 2 (s+) = θ. Further, both the ofﬂine data and behavior policy are uniform in each level. Consequently, πD 0 (L) = πD 0 (R) = 1

2 and d D 1 = unif(S1). We set Cs 1 = Cs 2 = 2, and Ca 2 = 2 so that πh = πh for all h.

Fix θ and estimated occupancies ˆdπθ and ˆd D. For any s S2 we have log dπθ 2 (s ) b log dπθ 2 (s ) = dπθ 1 (s ) 1[ ˆdπθ 1 (s ) ˆd D 1 (s )] 1[dπθ 1 (s ) d D 1 (s )]

Next, we instantiate ˆdπθ, ˆd D for any πθ. The preconditions of the proposition are satisﬁed by an estimated occupancy with ˆdπθ 1 (s L) = θ + ϵ/2 and ˆdπθ 1 (s R) = θ ϵ/2. In addition, we have an estimate ˆd D with ˆd D 1 (s L) = 1/2 ϵ/2 and ˆd D 1 (s R) = 1/2 + ϵ/2.

We will consider θ = 1/2, so that dπθ 1 Cs 1d D 1 . However, ˆdπθ 1 (s L) > ˆd D 1 (s L). As a result, log dπθ 2 (s ) b log dπθ 2 (s ) = dπθ 1 (s ) = O(1)

C.2 Proofs for Sec. 4.2

First, we formally state and prove the claim that Lem. 4.2 can be reduced to minimizing a squaredloss regression problem recursively over timesteps, i.e.,

log edπ h+1 (11)

= argmin g:S Rp EDh

g(s ) log π 1 π, Ca hd D h + log edπ h 1 1 edπ h, Cs hd D h 2 .

This is a reweighted ofﬂine analog of Eq. (2) from the online setting, and a more general version is presented below with proof. Lemma C.1. For g : S Rp and f : S A Rp and reweighting function ρ : S A R+, deﬁne the ofﬂine reweighted squared loss regression objective

e Lh(g; f, ρ) = EDh[ρ(sh, ah) g(sh+1) f(sh, ah) 2].

Then for any such f,

ED,ρ h (f) = argmin g:S Rp e Lh(g; f, ρ).

Further, for the smooth-clipped density ratio eρπ h = σ( e dπ h(s),Cs hd D h (s)) d D h (s) eπh(a|s) πD h (a|s) and smooth-clipped

target function yπ h := log π 1 π, Ca hd D h + log edπ h 1 edπ h, Cs hd D h from Lem. 4.2, we have

log edπ h+1 = argmin g:S Rp e Lh (g; yπ h, eρπ h) .

Proof of Lem. B.2. Since the objective is convex, can solve for the minimizer in closed form by taking the derivative and setting it to 0 in an element-wise manner. For each s ,

s,a P(s |s, a)πD h (a|s)d D h (s)ρ(s, a)

s,a P(s |s, a)πD h (a|s)d D h (s)ρ(s, a)f(s, a).

Rearranging and using the deﬁnition of ED,ρ h (Eq. (5)) gives the result. The second statement follows from Lem. 4.2.

Proof of Prop. 4.3 Part 1 follows from the gradient formula

xσ (x, c) = x β 1 x β + c β 1/β 1 = 1 + xβc β 1/β 1 .

It can be seen that xσ (x, c) [0, 1] and is non-increasing in its inputs, thus σ is monotonic. Additionally, |σ (x, c) σ (x , c)| |x x |. Since σ is symmetric in its arguments, we also have |σs (x, c) σs (x, c )| |c c |.

Next, we prove Part 2. Let z = (x c), and observe that z σ (x, c) z σ (z, z) since σ is monotonic. Further,

z = z (2z β) 1/β

z = 1 2 1/β 1 e 1/β 1

Rearranging and plugging in the expression for z gives the result.

Part 3 can be derived algebraically (but not easily), and is best intuited from the plot of the maximum slope supx,x ,c,c [0,1] | 1 (x, c) 1 (x , c) |/|x x | in Figure C.2, which corresponds to Lσ/c in the RHS of the bound. The left plot shows that the maximum slope increases linearly in β, and the right plot shows it increases inversely with c. The dashed red line is a guess for the exact constant Lσ/c = 0.3β/c, that upper-bounds the maximum slope. Clearly, Lσ = O(β).

10 20 30 40 50

Maximum slope

Varying ; fixed c = 0.5

0.2 0.4 0.6 0.8 1.0 c

Varying c; fixed = 4

Figure 2: The y-axis plots the maximum slope supx,x ,c,c [0,1] | 1(x,c) 1(x ,c)|

|x x | = Lσ/c.

Proof of Lem. 4.2 Using the chain rule,

s,a P(s |s, a) eπh 1(a|s) edπ h 1(s) + π(a|s) σ edπ h 1(s), Cs h 1d D h 1(s)

s,a P(s |s, a)eπh 1(a|s)σ edπ h 1(s), Cs h 1d D h 1(s) log eπh 1(a|s)

+ log σ edπ h 1(s), Cs h 1d D h 1(s)

s,a P(s |s, a)πD h 1(a|s)d D h 1(s)eρπ h 1(s, a) log eπh 1(a|s)

+ log σ edπ h 1(s), Cs h 1d D h 1(s) ,

where in the last line we use the deﬁnition of eρπ h 1 from Lem. 4.2 to make a change-of-measure. Further,

log σ edπ h 1(s), Cs h 1d D h 1(s) = edπ h 1(s) xσ edπ h 1(s), Cs h 1d D h 1(s)

= log edπ h 1(s) edπ h 1(s) xσ edπ h 1(s), Cs h 1d D h 1(s)

= log edπ h 1(s) 1 edπ h 1(s), Cs h 1d D h 1(s) ,

by deﬁnition. We can make the analogous statement for log eπh 1. Next, using the same change of measure in Def. 4.3, we have

edπ h(s ) = X

s,a P(s |s, a)πD h 1(a|s)d D h 1(s)eρπ h 1(s, a).

The lemma statement follows from log edπ h(s ) = edπ h(s )/edπ h(s ), and the deﬁnition of ED,eρ h 1 in Eq. (5). The gradient magnitude bound is proved in Lem. C.2. Lemma C.2 (Bounded gradient magnitude). Suppose σ is differentiable almost everywhere. Under Part 1 of Asm. 4.1 and Asm. 2.1,

log edπ h(s) h G.

Proof of Lem. C.2. As a consequence of Asm. 4.1, which states that the gradient of σ is nonincreasing in the ﬁrst argument, for any x, c 0 we have

σ (x, c) = Z x

0 xσs (z, c) dz Z x

0 xσs (x, c) dz = x xσ (x, c)

Then substituting x edπ h 1(s) and c d D h 1(s), the above shows that

log σ edπ h 1(s), d D h 1(s) log edπ h 1 pointwise. Since Lem. 4.2 involves a valid (conditional) expectation, for any s S we have log edπ h(s )

n log σ π(a|s), Ca h 1πD h 1(a|s) + log σ edπ h 1(s), Cs h 1d D h 1(s)

log σ edπ h 1(s), Cs h 1d D h 1(s) h G

where we use Asm. 2.1 in the second line, and unroll the same inequalities through levels in the last line.

Proof of Prop. 4.4 First, we bound the difference between the soft-clipped and clipped density functions: edπ h dπ h 1 σ edπ h 1, Cs h 1d D h 1 dπ h 1 Cs h 1d D h 1 1 + max s eπh 1( |s) πh 1( |s) 1

For the second term and any s,

eπh 1( |s) πh 1( |s) 1 = σ πh 1( |s), Ca h 1πD h 1( |s) πh 1( |s) Ca h 1πD h 1( |s) 1 Dσ πh 1( |s) Ca h 1πD h 1( |s) 1 Dσ

For the ﬁrst term, σ edπ h 1, Cs h 1d D h 1 dπ h 1 Cs h 1d D h 1 1

σ edπ h 1, Cs h 1d D h 1 edπ h 1 Cs h 1d D h 1 1 + edπ h 1 Cs h 1d D h 1 dπ h 1 Cs h 1d D h 1 1

Dσ edπ h 1 Cs h 1d D h 1 1 + edπ h 1 d π h 1 1 where in the last line we use Asm. 4.1 to upper bound the ﬁrst term, and the properties of the pointwise minimum to upper bound the second term. Since edπ h 1 Cs h 1d D h 1 edπ h 1 dπ h 1, we have edπ h dπ h 1 2Dσ + edπ h 1 d π h 1 1 h(Dσ + Dσ)

after rolling out the induction. Then for any π,

J(π) e J(π)

edπ h dπ h 1 2H2Dσ

Lastly, let eπ = argmaxπ ΠΘ e J(π), and deﬁne π similarly. Then

J( π ) e J(eπ ) J( π ) e J( π ) 2H2Dσ.

C.3 Proofs for Sec. 4.3

First, we give an example of G that satisﬁes Asm. 4.2 in low-rank MDPs. Proposition C.1. Suppose M is a low-rank MDP (Def. B.1), and suppose µ is known. For each layer h, deﬁne the function class

Gh = gh = µ Ψ

µ ψ : Ψ Rk p, ψ Rk, gh h G, h [H] .

Then {Gh} satisﬁes Asm. 4.2 and has pseudodimension (Def. F.1) d Gh = O(kp).

Proof of Prop. C.1. It sufﬁces to show that, for any f : S A Rp, reweighting function ρ : S A R+, and h [H], the gradient update in Lem. 4.2 has [ED,ρ h f] Gh+1.

Fix ρ, f, and h. From the deﬁnition of Eρ f, we have

[Eρ hf](s ) =

s,a P(s |s, a)πD h (a|s)d D h (s, a)ρ(s, a)f(s, a) P

s,a P(s |s, a)πD h (a|s)d D h (s, a)ρ(s, a)

Then since P(s |s, a) = ϕ(s, a), µ(s ) , we can apply the same steps as the proof of Prop. B.1 to show that there exists Ψ Rk p and ψ Rk such that

[Eρ hf](s ) = µ(s ) Ψ

µ(s ) ψ , s S.

Speciﬁcally, ψ = P

s,a ϕ(s, a)πD h (a|s)d D h (s, a)ρ(s, a), and the p-th column of Ψ is Ψp = P

s,a ϕ(s, a)πD h (a|s)d D h (s, a)ρ(s, a)f p(s, a).

Proof of Thm. 4.1 For the remainder of this section, we deﬁne the constants εw and εmle to be the estimation errors of ewπ and d D, respectively, such that for a given π and any h [H] we have

bwπ h ewπ h 1,d D, h 1 εw bd D h d D h 1 εmle and bd D, h d D, h 1 εmle.

We can obtain such estimates using Alg. 4 and Alg. 5, and a direct application of Lem. D.1

with union bound gives εmle = O q

log(H|F|/δ)

, and similarly Thm. E.1 states that εwreg =

log(H|W|/δ)

Next, recall that

b e J(π) = 1

(s,a,s ) Dgrad h

bwπ h(s )Rh(s )bgπ h(s ).

The expected value over draws of Dgrad is

h b e J(π) i = X

h Es d D, h 1 [ bwπ h(s )Rh(s )bgπ h(s )] .

First we bound the statistical error from using samples to approximate e J(π), given the gradient estimate. Fix the other datasets, then b e J(π) EDgrad h b e J(π) i

p max p [p]

b Es d D, h 1 [ bwπ h(s )Rh(s )bgπ h(s )] Es d D, h 1 [ bwπ h(s )Rh(s )bgπ h(s )]

h Cs h Ca h

max p [p],h [H]

b Es d D, h 1 [bgπ h(s )] Es d D, h 1 [bgπ h(s )]

h Cs h Ca h

where εstat = HG q

log(8p H/δ)

2n is obtained by using Hoeffding s with δ = δ/4, since the randomness in bw and bg are ﬁxed. Then for any p [p], EDgrad h

h b e J(π) i e J(π)

s d D, h 1(s )Rh(s ) bwπ h(s ) bgπ h(s ) wπ h(s ) log edπ h(s )

s d D, h 1(s )Rh(s )wπ h(s ) bgπ h(s ) log edπ h(s ) +

s d D, h 1(s )Rh(s )byπ h(s ) ( ˆwπ h(s ) wπ h(s ))

h G bwπ h wπ h 1,d D, h 1 + max p [p]

[bgπ h]p log edπ h 1, e dπ h

The ﬁrst term is bounded by Thm. E.1. For the second term, we use the following decomposition, which is proved at the end of this section.

Lemma C.3 (Gradient estimation error decomposition). Let εmle and εw be such that for all h [h] and π ΠΘ, we have

bd D h d D h 1, bd D, h d D, h 1 εmle h and bwπ h ewπ h 1,d D, h 1 εw h .

Then under Asm. 2.1 and Asm. 4.1, for any p [p], bgπ h from Alg. 2 satisﬁes bgπ,p h p log edπ h 1, e dπ h 6h Cs h 1Ca h 1LσG εmle h 1 (data distribution estimation error)

+ 3h LσG εw h 1 (occupancy estimation error)

+ bgπ,p h [ED,bρ h 1( log eπh 1 + bgh 1)]p 1, e dπ h (statistical regression error)

+ bgπ,p h 1 p log edπ h 1 1, e dπ h 1 (recursive term)

From Lem. C.4, we have [ED,bρ h 1bgh 1]p bgπ,p h 1, e dπ h q

2 1 + Cs h 1εmle h 1 εreg h + 2h G 2Cs h 1εmle h 1 + εw h 1

where εreg h = O( q

d GCs h 1Ca h 1h2G2 log(np H/δ)

n ). Then plugging the above into the decomposition in Lem. C.3, we have bgp h p log edπ h 1, e dπ h 10Cs h 1Ca h 1h GLσ εmle h 1 + 5h GLσ εw h 1

2 1 + Cs h 1εmle h 1 εreg h + bgp h 1 p log edπ h 1 1, e dπ h 1

Unrolling through timesteps, we have bgp h p log edπ h 1, e dπ h 10h2GLσ X

g<h Cs g Ca g εmle g + 5h2GLσ X

g<h εw g + X

2(1 + Csgεmle g )εreg g

h Cs h Ca h

εmle + 5H3GLσεw H + X

2 + Cs hεmle

Since εw H (P

h Cs h Ca h + 2 P

h Cs h) εmle +

h Cs h Ca h) εwreg, bgp h p log edπ h 1, e dπ h

h Cs h Ca h

h Cs h Ca h

2 + Cs hεmle

Finally, combining with Eq. (12) and upper bounding εreg further, we have e J(π) b e J(π)

h Cs h Ca h

! εstat + 25H3GLσεmle + 5

2H3GLσεwreg + p

! max h εreg h

Combining inequalities and plugging in the expression for each ε, we have

e J(π) b e J(π) c

d Gp H6G2 (P

h Cs h Ca h)2 L2σ log(|W||F|/δ) n .

Additional results Lastly, we state and prove the helper lemmas used above.

Proof of Lem. C.3. First from Lem. C.1, the population minimizer of Eq. (10) given bρπ, byπ h 1 is gπ h = ζπ h/f π h , where

ζπ h := Ebρ h 1 log eπ + byπ h 1

f π h := Ebρ h 1 (1)

Below, we use superscript p to select the p-th parameter of a gradient object. We ﬁrst separate out the statistical regression error by decomposing p log edπ h bgπ,p h 1, e dπ h gπ,p h bgπ,p h 1, e dπ h + p log edπ h gπ,p h 1, e dπ h

The ﬁrst term appears as the regression error in Lem. C.3. Since log edπ h = edπ h/edπ h, for any p [p] we have

p log edπ h gπ,p h 1, e dπ h =

edπ h ζπ,p h f π h edπ h p edπ h edπ h

edπ h f π h ζπ,p h f π h

1 + ζπ,p h p edπ h 1

Gh edπ h f π h 1 + ζπ,p h p edπ h 1

2Cs h 1 Gh bd D h 1 d D h 1 1 + Gh bdπ h 1 edπ h 1 1 + ζπ,p h p edπ h 1 (14)

The error edπ h f π h 1 bounded by Lem. C.5, and gπ,p h Gh is bounded by Lem. C.2.

We will bound the second term above. Letting yπ h 1 := log σ edπ h 1, Cs h 1d D h 1 be the (true)

regression target and using the gradient Bellman equation for log edπ h in Lem. 4.2, we have

ζπ,p h p edπ h 1

= Ebρ h 1 p log eπ + byπ,p h 1 Eeρπ

h 1 p log eπ + yπ,p h 1 1

σ( b dπ h 1,Cs h 1 b d D h 1) b d D h 1

eπh 1 πD h 1 d D h 1πD h 1 p log eπ + byπ,p h 1 σ edπ h 1, Cs h 1d D h 1 eπh 1 p log eπ + yπ,p h 1 1

σ bdπ h 1, Cs h 1 bd D h 1 p log eπ + byπ,p h 1 σ edπ h 1, Cs h 1d D h 1 p log eπ + yπ,p h 1 1

+ Cs h 1 G + byπ,p h 1

d D h 1 bd D h 1 1

σ bdπ h 1, Cs h 1 bd D h 1 byπ,p h 1 σ edπ h 1, Cs h 1d D h 1 yπ,p h 1 1

+ G σ bdπ h 1, Cs h 1 bd D h 1 σ edπ h 1, Cs h 1d D h 1 1 + Cs h 1 (G + Gh 1 ) d D h 1 bd D h 1 1

σ bdπ h 1, Cs h 1 bd D h 1 byπ,p h 1 σ edπ h 1, Cs h 1d D h 1 yπ,p h 1 1 + G bdπ h 1 edπ h 1 1 + Cs h 1 (2G + Gh 1 ) d D h 1 bd D h 1 1 (15)

Now consider the ﬁrst term above, where

byπ h 1 = bgπ h 1 1 bdπ h 1, Cs h 1 bd D h 1 and yπ h 1 = log dπ h 1 1 edπ h 1, Cs h 1d D h 1 .

Then plugging this into the ﬁrst line from the previous block, we have σ bdπ h 1, Cs h 1 bd D h 1 byπ,p h 1 σ edπ h 1, Cs h 1d D h 1 yπ,p h 1 1

= σ bdπ h 1, Cs h 1 bd D h 1 1 bdπ h 1, Cs h 1 bd D h 1 bgπ,p h 1 σ edπ h 1, Cs h 1d D h 1 1 edπ h 1, Cs h 1d D h 1 p log dπ h 1 1 bgπ,p h 1 p log dπ h 1 1, e dπ h 1

+ Gh 1 σ bdπ h 1, Cs h 1 bd D h 1 1 bdπ h 1, Cs h 1 bd D h 1 σ edπ h 1, Cs h 1d D h 1 1 edπ h 1, Cs h 1d D h 1 1 ,

where we add and subtract σ edπ h 1, Cs h 1d D h 1 1 edπ h 1, Cs h 1d D h 1 bgπ,p h 1 to obtain the inequality. The ﬁrst error is the recursive term, so it remains to bound the second, for which we will use the smoothness properties of 1 (x, c) from Asm. 4.1. σ bdπ h 1, Cs h 1 bd D h 1 1 bdπ h 1, Cs h 1 bd D h 1 σ edπ h 1, Cs h 1d D h 1 1 edπ h 1, Cs h 1d D h 1 1

σ edπ h 1, Cs h 1d D h 1 1 bdπ h 1, Cs h 1d D h 1 1 edπ h 1, Cs h 1d D h 1 1

+ σ bdπ h 1, Cs h 1 bd D h 1 1 bdπ h 1, Cs h 1 bd D h 1 1 bdπ h 1, Cs h 1d D h 1 1

+ σ edπ h 1, Cs h 1d D h 1 σ bdπ h 1, Cs h 1 bd D h 1 1 bdπ h 1, Cs h 1d D h 1 1

σ( e dπ h 1,Cs h 1d D h 1) Cs h 1d D h 1

bdπ h 1 edπ h 1 1 + Lσ

σ( b dπ h 1,Cs h 1 b d D h 1) b dπ h 1

bd D h 1 d D h 1 1

+ σ edπ h 1, Cs h 1d D h 1 σ bdπ h 1, Cs h 1 bd D h 1 1

(Lσ + 1) bdπ h 1 edπ h 1 1 + Lσ + Cs h 1 bd D h 1 d D h 1 1 (17)

Lastly, bdπ h edπ h 1 Cs Ca bd D, h 1 d D, h 1 1 + bwπ h ewπ h 1,d D, h 1 (18)

Finally, after combining Eq. (14), Eq. (15), Eq. (16), Eq. (17), and Eq. (18), we have p log edπ h bgπ,p h 1, e dπ h bgπ,p h 1 p log dπ h 1 1, e dπ h 1 + ζπ,p h bgπ,p h 1, e dπ h + (G + (Lσ + 1) Gh 1 + Gh ) bwπ h 1 ewπ h 1 1,d D, h 1

+ Cs h 1Ca h 1 (G + (Lσ + 1) Gh 1 + Gh ) bd D, h 1 d D, h 1 1 + 2Cs h 1G + (Lσ + Cs h 1) Gh 1 + Gh bd D h 1 d D h 1 1

Plugging in Gh h G and consolidating terms, gives the result.

Lemma C.4. With probability 1 δ, for all h [H] and a ﬁxed π we have bgπ h ED,bρπ

h 1 ( log eπh 1 + bgπ h 1) 1, e dπ h q

2 1 + Cs h 1εmle h 1 εreg h + 2h G 2Cs h 1εmle h 1 + εw h 1

Proof. Let f π h (s ) = P s,a P(s |s, a)bρπ(s, a)d D h 1(s)πD(a|s) be the data distribution reweighted

by bρπ. For short, we use yπ,p h = [ED,bρπ

h 1 ( log eπh 1 + bgπ h 1)]p. For any p [p],

bgπ,p h yπ,p h 1, e dπ h bgπ,p h yπ,p h 1,f π h + bgπ,p h yπ,p h f π h edπ h 1

f π h 2 bgπ,p h yπ,p h 2,f π h + 2h G f π h edπ h 1 ,

where in the second line we use Cauchy-Schwarz on the ﬁrst term and Lem. C.2 on the second term. Consider the ﬁrst term. One can loosely bound p

s,a,s P(s |s, a)bρπ h 1(s, a)πD h 1(a|s)d D h 1(s)

s,a,s P(s |s, a)eπh 1(a|s)d D h 1(s) Cs h 1,

or a get a tighter result with p

s,a,s P(s |s, a)bρπ h 1(s, a)πD h 1(a|s)d D h 1(s)

s,a,s P(s |s, a) σ edπ h 1(s), Cs h 1d D h 1(s)

ˆd D h 1(s) eπh 1(a|s) d D h 1(s) bd D h 1(s)

s,a,s P(s |s, a)eπh 1(a|s)σ edπ h 1(s), Cs h 1d D h 1(s)

Cs h 1 d D h 1 bd D h 1 1 + 1

Next we bound bgπ,p h yπ,p h 2,f π h . Deﬁne

Lp h 1(g; y, ρ) = b E(s,a,s ) Dh 1 h ρ(s, a) (gp(s ) ( log eπh 1(a|s) + yp(s)))2i

to be the p-th parameter version of Eq. (10). Recall the regression target byπ h 1, then we have

bgπ,p h yπ,p h 2 2,f π h = E Lp h 1(bgπ h; byπ h 1, bρπ h 1) E Lp h 1(yπ h; byπ h 1, bρπ h 1)

2 Lp h 1(bgπ h; byπ h 1, bρπ h 1) Lp h 1(yπ h; byπ h 1, bρπ h 1) + 2εreg h (Lem. F.2)

2εreg h (yπ h Gh, Asm. 4.2)

ˆyp h yp h 1, e dπ h r

2 1 + Cs h 1 d D h 1 ˆd D h 1 1

εreg h + 2h G f π h edπ h 1

Plugging in the bound from Lem. C.5 for f π h edπ h 1 gives the result.

Lemma C.5. For any π and estimates {bdπ h}, {bd D h }, let bρπ be deﬁned as in Alg. 2, and for any h [H] and s S deﬁne

f π h (s ) := X

s,a P(s |s, a)bρπ(s, a)πD h 1(a|s)d D h 1(s),

to be the next-state marginal distribution induced by reweighting d D with ˆρ. We have f π h edπ 1 2Cs h 1 ˆd D h 1 d D h 1 1 + ˆdπ h 1 edπ h 1 1

Proof. Using the deﬁnition of bρπ, we ﬁrst rewrite f π h (s ) = P

s,a P(s |s, a)eπh 1(a|s) σ( b dπ h 1(s),Cs h 1 b d D h 1(s)) b d D h 1(s) d D h 1(s). Then

f π h edπ h 1

σ ˆdπ h 1, Cs h 1 bd D h 1

bd D h 1 d D h 1 σ edπ h 1, Cs h 1d D h 1 1

σ bdπ h 1, Cs h 1 bd D h 1

bd D h 1 d D h 1 1

+ σ bdπ h 1, Cs h 1 bd D h 1 σ edπ h 1, Cs h 1d D h 1 1

2Cs h 1 bd D h 1 d D h 1 1 + bdπ h 1 edπ h 1 1 where in the last inequality we use Asm. 4.1 to bound the second term.

C.4 Local convergence of OFF-OCCUPG

We demonstrate that OFF-OCCUPG can converge to a ε-stationary point. In order to establish this result, we will need the guarantee in Thm. 4.1 to hold for all possible policies, i.e., b e J(π) e J(π) ε for all π ΠΘ. This is because the ﬁxed ofﬂine data is reused throughout the algorithm, which introduces additional correlations between iterations. In the online setting it was sufﬁcient to simply union bound over iterations, and not functions in our function classes, because we drew fresh trajectories for each policy iterate.

Since G and ΠΘ are continuous function classes, we will start our result in terms of their covering numbers, deﬁned below. We handle this in the simplest manner by using ℓ coverings, and leave a more reﬁned analysis to future work. Def. C.1 is a covering on the clipped policy ratio over πD. For example, the direct policy parameterization with πθ = θ has N D (γ, ΠΘ) (maxh Ca h/γ)HSA (Lem. C.10). In the below deﬁnition, we overload the deﬁnition of π (the clipped policy in Lem. 4.1) temporarily.

Deﬁnition C.1 (Policy ratio γ-cover). Let ΠΘ be an ℓ covering of ΠΘ such that for any π ΠΘ there exists π ΠΘ with σ(π,Ca hπD h ) πD h σ(πh,Ca hπD h ) πD h γ. Let N D (γ, ΠΘ) denote its minimum cardinality. Deﬁnition C.2 (Gradient function class γ-cover). Denote N (γ, G) to be the ℓ covering number of {Gh} with resolution γ.

Next, we state the stationary convergence guarantee in terms of these function class complexities, the ofﬂine coverage coefﬁcient determined by input clipping constants {Cs h, Ca h}, and Lσ that represents the second-order smoothness of σ. Corollary C.1. Suppose Asm. 3.2 holds. Then under the preconditions of Thm. 4.1,

t E h Gη(π(t), e J(π(t))) 2i ε

h Cs h Ca h)2 L2 σ log(N (ε, G)N D (ε, ΠΘ)|W||F|) ε

Proof of Cor. C.1 First, we invoke a union-bounded version of the ofﬂine regression estimation guarantee in Lem. F.2. For any π, g : S Rp, reweighting function ρ : S A R+, and target function y : S Rp, deﬁne the p-th parameter squared loss for a ﬁxed policy to be

Lπ,p h (g; y, ρ) = b E(s,a,s ) Dh h gp(s ) ( p log eπh(a|s) + yp(s)))2 i

Then from Lem. F.3, With probability at least 1 δ , for all h [H], p [p], g Gh+1, and ρ, y induced by F, W (see preconditions of Lem. F.3 for more exact statement), we have E[Lπ,p h (g; y, ρ) Lπ,p h (g h+1; y, ρ)] Lπ,p h (g; y, ρ) Lπ,p h (g h+1; y, ρ)

2E[Lπ,p h (g; y, ρ) Lπ,p h (g h+1; y, ρ)] + (εreg h+1)2,

where g h+1 = ED,ρ h [ log eπh+yh] and εreg h = O q

Cs h 1Ca h 1h2G2 log(N (n 1,G)p H|Wh||Fh|/δ )

To complete the regression part of the analysis we need to take a union bound over the result in Lem. F.3 for all π ΠΘ. For any π ΠΘ, let π ΠΘ of Def. C.1 be its ℓ cover. We need to bound the covering approximation error Lπ,p h (g; y, ρ) Lπ,p h (g; y, ρ). Consider a ﬁxed (h, s, a, s ) and ﬁx the inputs (g, ρ, y, π), for which

Lπ,p h (g; y, ρ)[s, a, s ] Lπ,p h (g; y, ρ)[s, a, s ]

= ρ(s) σ(π(a|s),Ca hπD h (a|s)) πD h (a|s) gp(s ) 2( p log eπ(a|s) + yp(s)) + g ,p h+1(s ) gp(s ) g ,p h+1(s )

Then E[Lπ,p h (g; y, ρ) Lπ,p h (g; y, ρ)] (Lπ,p h (g; y, ρ) Lπ,p h (g; y, ρ))

8Csh2G2 σ(π,Ca hπD h ) πD h σ(π,Ca hπD h ) πD h

+ 8Cs Cah G max s,a log eπ(a|s) loge[π](a|s)

8(Cs Cah2G2 + Cs Cah GLσβ)ε

where we get smoothness of the gradient portion using Asm. 4.1 and Asm. 3.2. Then by setting ε = (8(Cs Cah2G2 + Cs Cah GLσβ))n 1 and combining the above errors with Lem. F.3, we have that with probability at least 1 δ for all π ΠΘ that E[Lp Dh(ρh, gh+1, yh, π) Lp Dh(ρh, g h+1, yh, π)] Lp Dh(ρh, gh+1, yh, π) Lp Dh(ρh, g h+1, yh, π)

2E[Lp Dh(ρh, gh+1, yh, π) Lp Dh(ρh, g h+1, yh, π)]

Cs h 1Ca h 1h2G2 log(N D (n 1, ΠΘ)N (n 1, G)p H|Wh||Fh|/δ)

n := εreg h ,

for some absolute constant c. The remainder of the proof is identical to the proof of Thm. 4.1 using the above εreg h , which is then combined with Lem. G.1 to give the result.

C.5 Global convergence of OFF-OCCUPG

We now turn our attention to analyzing gradient domination of the ofﬂine objective e J(π). The preconditions of our result are written in terms of the smooth-clipped analog of the pessimistic value function Qπ (Prop. 4.1), induced by the smooth-clipped occupancy gradient edπ. For each (h, s, a), deﬁne

e Qπ h(s, a) := xσ π(a|s), Ca hπD h (a|s) X

s P(s |s, a) R(s )

+ xσ edπ h+1(s ), Cs h+1d D h+1(s ) e Qπ h+1(s , eπh+1) .

Lem. C.6 shows that the optimality gap of a policy for the smooth-clipped objective e J(π) is bounded by a measure of its gradient magnitude, as well as a coverage coefﬁcient. This is because our trick with exploratory µ in Lem. 3.2 isn t applicable, as it is not covered by the data in D0 supported on S0. Without this, our ofﬂine gradient domination guarantee in Lem. C.6 has a coverage coefﬁcient that resembles the ﬁrst inequality of Lem. B.3 when µ = d0, the original initial state distribution.

Lemma C.6. For any π and π , deﬁne e Bπ(π ) := P

h,s,a edπ h(s)eπ (a|s) e Qπ h(s, a). Suppose that π ΠΘ,

1. (Policy completeness) There exists π+ ΠΘ such that π+ argmaxπ e Bπ(π ).

2. (Gradient domination) maxπ ΠΘ e Bπ(π ) e Bπ(π) m maxθ Θ D e Bπ(π), θ θ E .

Then for any comparator policy πE and πθ ΠΘ, we have

e J(πE) e J(πθ) m

σ edπE h , Cs hd D h

σ edπθ h , Cs hd D h

max θ Θ e J(πθ), θ θ .

Compared to Lem. 3.2, the ﬁrst precondition of Lem. C.6 may be stronger because π+ can be a stochastic policy, whereas deterministic policies sufﬁce in the online setting. The second precondition is of similar strength. More importantly as we have previously discussed coverage coefﬁcients of the form in Lem. C.6 are not ideal because they involve πθ in the denominator, which are variable over the learning process.

Ofﬂine data as an exploratory initialization Since all occupancies are clipped to some constant of the ofﬂine data, however, we might wonder if the ofﬂine data distribution itself might serve as an exploratory initial distribution to use with OFF-OCCUPG (in some sense, this, or some reweighted version of it, is the only thing available to us in the ofﬂine setting). Prop. C.2 shows that this is indeed possible when the ofﬂine data is exploratory enough, and we use clipping for simplicity. Notably, the coverage coefﬁcient present in the gradient domination bound is the input clipping constant P

h Cs h. This can be seen as an ofﬂine analog of Cπ in Lem. 3.2, since all occupancies are clipped to have this ratio over the ofﬂine data.

Proposition C.2. Given {d D h }, deﬁne a new data distribution where d D h = 1

H PH 1 g=0 d D g , h [H].

Then for any π, use {d D h }, σ = , and clipping constants {Cs h , Ca h } to deﬁne [edπ h] according to Def. 4.3. Let e J (π) = P h [edπ h] , R .

For any π and π , recall e Bπ(π ) := P

h,s,a[edπ h] (s)eπ (a|s)[ e Qπ] h(s, a). Suppose that π ΠΘ,

1. (Policy completeness) There exists π+ ΠΘ such that π+ argmaxπ e Bπ(π ).

2. (Gradient domination) maxπ ΠΘ e Bπ(π ) e Bπ(π) m maxθ Θ D e Bπ(π), θ θ E .

Then if {Cs h , Ca h } are such that [edπθ h ] Cs h d D h , h, for any πθ ΠΘ, we have

max π e J(π) e J(πθ) m H

max θ Θ e J (πθ), θ θ .

In practice, we can easily generate a new dataset D satisfying Prop. C.2 by ﬁrst splitting each Dh into H equal parts {Di h}H i=1, then setting D h = H g=1Dh g . The sample complexity of running Alg. 2 on D will then scale with P h Cs h, which are input parameters, instead of the coefﬁcient in Lem. C.6, which is θ-dependent and cannot be controlled. In exchange, it requires all-policy coverage w.r.t the new [edπ] , which, while strong, was insufﬁcient for optimality in Lem. C.6. One justiﬁcation (formalized in the hardness result of Prop. C.3) is that the exploratory initialization can cause policies to exceed coverage thresholds on reward-generating states, despite being covered on (the original) d0. Clipping causes gradient signals to vanish, so a stationary policy might be far off-support, instead of optimal. While everything works out conceptually if {Cs h, Ca h} are set to be high enough, it s unclear whether doing this will require exponentially large coefﬁcients in the worst case.

Lastly, we combine the above gradient domination claims with the stationary convergence guarantee in Cor. C.1 to state the following global convergence result. Cor. C.2 is stated for J(π), our original ofﬂine optimization objective, and therefore takes into the account of approximating the clipping function with its smooth-clipped version.

Corollary C.2. Suppose e J(π) satisﬁes Asm. 3.2. If Alg. 2 with D as deﬁned in Prop. C.2 satisﬁes the preconditions of Prop. C.2, then set CC = H P

h Cs h. Otherwise, deﬁne CC =

maxπ ΠΘ maxh σ edπ h , Cs hd D h /σ edπ h, Cs hd D h and assume the preconditions of Lem. C.6. Then under Def. C.1 and the preconditions of Thm. 4.1, Alg. 2 satisﬁes

n max π J(π) J(π(t)) o#

when T = e O B2m2(CC)2βH

n = e O B2m2(CC)2p H6G2( P

h Cs h Ca h) 2L2 σ log(N (ε,G) N D (ε,ΠΘ)|F||W|) ε2

Though we optimize e J(π), the guarantee in Cor. C.2 is with respect to our target ofﬂine objective J(π), which implies that the learned policy competes with the best policy fully covered by ofﬂine data. Generally Lσ and Dσ trade-off between ease of convergence (smoothness) and approximation error, respectively. For example, instantiating the bound with σ from Prop. 4.3 with b ε results in a ﬁnal ε 1/4 rate.

C.6 Proofs for App. C.5

Proof of Lem. C.6 We will use superscript h to refer to sh Sh, the set of states visitable at timestep h, and drop Cs h below to reduce clutter. By the performance difference upper bound for the smooth-clipped objective in Lem. C.8, we have

e J(π ) e J(πθ)

s,a σ edπ h (s), d D h (s) (eπ (a|s) eπθ(a|s)) e Qπθ h (s, a)

sh,ah σ edπ h (sh), d D h (sh) eπ (ah|sh) eπθ(ah|sh) e Qπθ h (sh, ah)

since d D h (s) > 0 only if s Sh. Now deﬁne π+ such that for any s, eπ+( |s) =

argmaxπ (A) D eπ, e Qπθ h (s, ) E . Then

e J(πE) e J(πθ)

sh,ah σ edπE h (sh), d D h (sh) eπE(ah|sh) eπθ(ah|sh) e Qπθ h (sh, ah)

sh,ah σ edπE h (sh), d D h (sh) eπ+(ah|sh) eπθ(ah|sh) e Qπθ h (sh, ah) (20)

σ edπE h , Cs hd D h

σ edπθ h , Cs hd D h

sh,ah σ edπθ h (sh), d D h (sh) π+(ah|sh) πθ(ah|sh)

1 πθ(ah|sh), Ca hπD(ah|sh) e Qπθ h (sh, ah)

σ edπE h , Cs hd D h

σ edπθ h , Cs hd D h

sh,ah σ edπθ h (sh), d D h (sh) eπ+(ah|sh) eπθ(ah|sh) e Qπθ h (sh, ah)

σ edπE h , Cs hd D h

σ edπθ h , Cs hd D h

e Bπ+(πθ) e Bπθ(πθ)

σ edπE h , Cs hd D h

σ edπθ h , Cs hd D h

e Bπθ (πθ) e Bπθ(πθ)

Proof of Prop. C.2 Under all-policy coverage, we can apply the result in Lem. 3.2, noting that d 0 = 1

H d D h , and dπ h Cs hd D h .

P(X|S, a) = ϵ

Figure 3: Example in Prop. C.3

Proposition C.3 (Vanishing gradient from clipping with exploratory data). Consider the MDP in App. C.6, and the data distribution where d D(X) = 1/2 and d D(Y ) = ϵ and d D(Z) = (1 ϵ)/2 for some ϵ [0, 1]. For any C, we have all-policy coverage, i.e., dπ h Chd D h for all h and all policies π. Let π be the stationary (and in this case, optimal) policy of running Alg. 2 with D described in Prop. C.2. Then J(π ) J(π) = (1 ϵ) (1 2CY ϵ) .

If ϵ is exponentially small, J(π ) J(π) = O(1) unless CY is exponentially large.

Proof. The example boils down to a simple bandit problem of choosing either L or R in state X. π(L|X) = CY d D(Y )

d D(X) = 2CY ϵ, and π(R|X) = 1 π(L|X). In comparison, π (L|X) = 1. Then

e J(π) = J(π) = CY d D(Y )

d D(X) + ϵ(1 π(L|X)). In comparison, J(π ) = 1, so

J(π ) J(π) = (1 ϵ) (1 2CY ϵ)

For reasonable choices of CZ (say, 2 or 3), CY must be proportional to ϵ 1 for the suboptimality gap to shrink, and in particular if ϵ is exponentially small then CY must be exponentially large, which blows up the RHS of the bound.

Proof of Cor. C.2 The ﬁrst step follows the proof of Cor. 3.2. Combining Thm. 4.1 with Lem. G.5 and plugging in above, we have

t e J(π ) e J(π(t))

h Cs h Ca h)2 L2σ log(N (ε, G) N D (ε, ΠΘ)|F||W|) n

Then we also have

t J(π ) J(π(t))

t e J(π ) e J(π(t))

+ 2H2Dσ (Prop. 4.4)

2H2Dσ + Bm CC

h Cs h Ca h)2 L2σ log(N (ε, G) N D (ε, ΠΘ)|F||W|) n

Additional results Helper lemmas are stated and proved below.

Lemma C.7. Suppose σ satisﬁes Parts 1 and 2 of Asm. 4.1. Then for e J(π) = P

s edπ h(s)Rh(s),

s,a σ edπ h(s), Cs hd D h (s) eπh(a|s) e Qπ h(s, a),

e Qπ h(s, a) = X

s Ph(s |s, a)

Rh+1(s ) + X

a eπh+1(a |s ) xσ edπ h+1(s ), Cs h+1d D h+1(s ) e Qπ h+1(s , a )

Proof of Lem. C.7. For notational clarity we omit Cs h below. Expanding edπ, we have

edπ h(sh) = X

sh 1,ah 1 P(sh|sh 1, ah 1) eπh 1(ah 1|sh 1)σ edπ h 1(sh 1), d D h 1(sh 1)

+ eπh 1(ah 1|sh 1) xσ edπ h 1(sh 1), d D h 1(sh 1) edπ h 1(sh 1)

(sh 1,ah 1,...,sg,ag)

t=g+1 P(st+1|st, at)eπt(at|st) xσ edπ t (st), d D t (st) #

P(sg+1|sg, ag)σ edπ g (sg), d D g (sg) eπg(ag|sg)

For short, deﬁne

e Peπ(sh|sg, ag) := X

(sh 1,ah 1,...,sg+1,ag+1)

t=g+1 P(st+1|st, at)eπt(at|st) xσ edπ t (st), d D t (st) #

P(sg+1|sg, ag)

observing if eπh = πh and xσ edπ h, d D h = 1 for all h, we have e Peπ(sh|sg, ag) = Pπ(sh|sg, ag), the

standard transition kernel from (sg, ag) sh. This occurs, for example, when σs is hard clipping and π is fully covered by data. Then using the above deﬁnition, we have

edπ h(sh) = X

sg,ag σ edπ g (sg), d D g (sg) eπg(ag|sg)e Peπ(sh|sg, ag). (21)

Plugging this expression into J(π), we obtain

sh edπ h(sh)R(sh)

sg,ag σ edπ g (sg), d D g (sg) eπg(ag|sg)e Peπ(sh|sg, ag)

sg,ag σ edπ g (sg), d D g (sg) eπg(ag|sg)

e Peπ(sh|sg, ag)R(sh)

sg,ag σ edπ g (sg), d D g (sg) eπg(ag|sg) e Qπ(sg, ag)

where we have deﬁned

e Qπ g (sg, ag) :=

e Peπ(sh|sg, ag)R(sh)

sg+1 P(sg+1|sg, ag)

R(sg+1) + X

ag+1 eπg+1(ag+1|sg+1) xσ edπ g+1(sg+1), d D g+1(sg+1) e Qeπ g+1(sg+1, ag+1)

Lemma C.8. If σ is concave in its ﬁrst argument, for any π and π we have

e J(π ) e J(π)

s,a σ edπ h (s), d D h (s) (eπ h(a|s) eπh(a|s)) e Qπ h(s, a),

where e Qπ h is deﬁned in Lem. C.7.

Proof. This statement follows straightforwardly from plugging in Lem. C.9 and rearranging, similar to the proof of Lem. C.7.

Lemma C.9. If σ is concave in its their ﬁrst arguments, then for any h and π, π

edπ h (s ) edπ h(s )

s,a σ edπ g (s), d D g (s) σ π g(a|s), πD g (a|s) σ πg(a|s), πD g (a|s) e Pπ(sh = s |sg = s, ag = a),

e Pπ(sh|sg, ag) := X

sh 1:g+1,ah 1:g+1

t=g+1 P(st+1|st, at)eπt(at|st) xσ edπ t (st), d D t (st) #

P(sg+1|sg, ag).

Proof of Lem. C.9. Deﬁne πg = {π 1, . . . , π g 1, πg, . . . , πH 1}, i.e., a policy that starts playing π at timestep g, and plays π for the timesteps before that.

edπ h (s ) edπ h(s ) = edπ h (s ) edπh 1 h (s ) + edπh 1 h (s ) edπ h(s )

For the ﬁrst pair of terms, π and πh 1 only differ the policy used to take the action ah 1 (and both play π before that), thus dπ h 1 = dπh 1 h 1 and

edπ h (s ) edπh 1 h (s )

s,a P(s |s, a) σ π h 1(a|s), πD h 1(a|s) σ πh 1(a|s), πD h 1(a|s) σ edπ h 1(s), d D h 1(s)

For the second pair of terms, πh 1 and π both play π at time h 1, but the former uses π for timesteps 1, . . . , h 2:

edπh 1 h (s ) edπ h(s )

s,a P(s |s, a)σ πh 1(a|s), πD h 1(a|s) σ edπh 1 h 1 (s), d D h 1(s) σ edπ h 1(s), d D h 1(s)

s,a P(s |s, a)σ πh 1(a|s), πD h 1(a|s) σ edπ h 1(s), d D h 1(s) σ edπ h 1(s), d D h 1(s)

s,a P(s |s, a)σ πh 1(a|s), πD h 1(a|s) xσ edπ h 1(s), d D h 1(s) edπ h 1(s) edπ h 1(s)

where the last inequality above uses the concavity of σs in the ﬁrst argument (recall concave functions f satisfy f(y) f(x) + f (x)(y x)). Combining the above two inequalities, we have the recursive relationship

edπ h (s ) edπ h(s )

s,a P(s |s, a)

σ π h 1(a|s), πD h 1(a|s) σ πh 1(a|s), πD h 1(a|s) σ edπ h 1(s), d D h 1(s)

+ σ πh 1(a|s), πD h 1(a|s) xσ edπ h 1(s), d D h 1(s) edπ h 1(s) edπ h 1(s) !

Unrolling through timesteps gives the lemma statement.

Lemma C.10. Let Ca = maxh Ca h. Suppose ΠΘ is the direct policy parameterization, i.e., πθ(a|s) = θs,a, and σ is such that Dσ Ca for all h. Then for any γ (0, 1], in Def. C.1 we have N D (γ, ΠΘ) (Ca/γ)SAH.

Proof of Lem. C.10. Typical gridding-style arguments discretize the range of π(a|s) for each (s, a). Since we are concerned with creating a cover for the policy ratio, however, a naive argument will incur 1/ mins,a πD(a|s) in the grid s cardinality. Our solution is to grid ΠΘ adaptively according to the magnitude of πD(a|s). Intuitively, we only need to grid up to the threshold

For each (h, s, a), deﬁne the adaptive gridding scale to be γ hsa = γπD h (a|s). For any π ΠΘ, set its cover π as follows.

k , if π(a|s) Ca hπD h (a|s),

Ca hπD h (a|s), otherwise.

Let ΠΘ = {π : π ΠΘ}. Then |ΠΘ| (maxh Ca h/γ)HSA. Further, π(a|s) Ca hπD h (a|s)

πh(a|s) Ca hπD h (a|s) γπD h (a|s),

thus (π Ca hπD h ) (πh Ca hπD h ) πD h γ, and applying Lem. C.11 gives the result.

Lemma C.11. Suppose ΠΘ satisﬁes Def. C.1 with σxc = (x c). Then for any π ΠΘ, let π ΠΘ be its cover. Under Asm. 4.1, we have σ(π,CπD)

πD σ(π,CπD)

Proof of Lem. C.11. If π(a|x) CπD(a|x), using the 1-Lipschitzness of σ we have

|σ π(a|s), CπD(a|s) σ π(a|s), CπD(a|s) |

= |σ π(a|s) CπD(a|s) , CπD(a|s) σ π(a|s), CπD(a|s) |

| π(a|s) CπD(a|s) π(a|s)|

Algorithm 4 Maximum Likelihood Estimation Input: datasets {Dh}, function class F

1: for h = 1, . . . , H do 2: Estimate marginal data distributions bd D h 1 and bd D, h 1 by MLE on dataset Dh 1

bd D h 1 = argmax dh 1 Fh 1

(s, , ) Dh 1 log (dh 1(s)) (22)

bd D, h 1 = argmax dh Fh

( , ,s ) Dh 1 log (dh(s )) .

3: end for Output: estimated data distributions {bd D h }h [H] and {bd D, h }h [H]

If π(a|x) > CπD(a|x),

|σ π(a|s), CπD(a|s) σ π(a|s), CπD(a|s) | CπD(a|s) σ π(a|s), CπD(a|s)

CπD(a|s) (1 Dσ) π(a|s) CπD(a|s)

C(Dσ + γ)πD,

using Asm. 4.1 in the second inequality. As a result, σ(π(a|s),CπD(a|s))

πD(a|s) σ(π(a|s),CπD(a|s))

D Maximum Likelihood Estimation

Algorithm 4 displays the data distribution estimation procedure used in ofﬂine gradient estimation (Algorithm 2), which is a direct application of MLE. The general formulation of the MLE problem utilized in this paper is to estimate a probability distribution over the instance space S. Given an i.i.d. sampled dataset D = {s(i)}n i=1 and a function class F, we optimize the MLE objective of the form

bf = argmin f F

s D log (f(s)) . (23)

We assume F is ﬁnite, and refer readers to [LNSJ23; HCJ23] for techniques for handling inﬁnite function classes. The general MLE guarantee is stated below, and is a well-established result (for example, a proof can be found in Appendix E of [AKKS20]).

Lemma D.1 (MLE guarantee). Let D = {s(i)}n i=1 be a dataset, where s(i) are drawn i.i.d. from some ﬁxed probability distribution f over S. Consider a function class F that satisﬁes: (i) f F, and (ii) each function f F is a valid probability distribution over S (i.e., f (S)) Then with probability at least 1 δ, bf from Eq. (23) has ℓ1 error guarantee

2 log(|F|/δ)

The formal guarantee of Algorithm 4 is stated below, which is a straightforward application of Lemma D.1 with union bound (over all functions in F, and over all timesteps).

Assumption D.1 (MLE Realizability). Suppose that h [h], we have d D h , d D, h 1 Fh for D deﬁned in Def. 4.1. Additionally, f (S) is a valid distribution for all f Fh.

Algorithm 5 Fitted Occupancy Iteration with Smooth Clipping Input: policy π, datasets {Dh}, function class W, clipping thresholds {Cs h, Ca h}, data estimates {bd D h } and {bd D, h }.

1: Initialize bdπ 0 = bd D 0 . 2: for h = 1, . . . , H do

3: Deﬁne LDh 1(wh, wh 1, eπh 1) := 1 |Dh 1| P

(s,a,s ) Dh 1

wh(s ) wh 1(s) eπh 1(a|s)

πD h 1(a|s)

2 , and estimate

bwπ h = argmin wh Wh LDh 1

wh, σ( b dπ h 1,Cs h 1 b d D h 1) b d D h 1 , σ πh 1, Ca h 1πD h 1 , (24)

4: Set the estimate bdπ h = bwπ h bd D, h 1. 5: end for Output: estimated state occupancies { bwπ h}h [H].

Lemma D.2. Suppose {Fh} satisﬁes Asm. D.1. Then with probability at least 1 δ, for all h [H]

the outputs of Algorithm 4 satisfy bd D h d D h 1 εmle and bd D, h d D, h 1 εmle, where

2 log(2H|F|/δ)

E Ofﬂine Density Estimation

The algorithm for ofﬂine density estimation is displayed in Alg. 5, and is directly copied from Algorithm 1 of [HCJ23], but with two minor modiﬁcations. The ﬁrst is that the densities are clipped using a function σ, that can take clipping as a special case. The second is that it outputs the learned weights instead of the learned densities. The weight function class completeness assumption is shown Asm. E.1, and is satisﬁed in low-rank MDPs using linear-over-linear function classes that have pseudo-dimension bounded by MDP rank. It can be seen as a 1-dimensional version of Asm. 4.2 where ρ = 1 and in that sense strictly weaker. Assumption E.1 (Weight function completeness). For any π ΠΘ and h [H], we have

σ(w f ,Cs h 1f) f eπh 1 πD h 1

Wh, w Wh 1, f, f Fh 1,

Theorem E.1. Suppose σ satisﬁes Asm. 4.1 and W satisﬁes Asm. E.1. Let {bd D h }g and {bd D, h } be such that h [H], bd D h d D h 1 εmle and bd D, h d D, h 1 εmle.

Then with probability at least 1 δ, the outputs { bwπ h} of Algorithm 5 satisfy for all h [H]

bwπ h ewπ h 1,d D, h 1

g<h 1 Cs g Ca g + 2 X

g<h Cs g Ca g

εwreg, (25)

where εwreg := q

c log(H|W|/δ)

nreg for some absolute constant c.

Proof of Theorem E.1 We begin by stating the following decomposition on the error of bwπ h, which is proved at the end of this section. Lemma E.1. Suppose σ satisﬁes Assumption 4.1. Then for any h [H], the error between bwπ h and the target ewπ h = edπ h/d D, h 1 can be recursively decomposed as

bwπ h ewπ h 1,d D, h 1 bwπ h 1 ewπ h 1 1,d D, h 2

+ 2Cs h 1 bd D h 1 d D h 1 1 + Cs h 2Ca h 2 bd D, h 2 d D, h 2 1 + bwπ h E π h 1 d D h 1 ωπ h 1 2,d D, h 1 ,

where ωπ := σ( b dπ h 1,Cs h 1 b d D h 1) b d D h 1 .

Applying Lem. E.2 with union bound over all h, we have bwπ h E π h 1 d D h 1 ωπ h 1 2 2,d D, h 1

= E h LDreg h 1 bwπ h, ωπ h 1, π i E h LDreg h 1 E π h 1 d D h 1 ωπ h 1 , ωπ h 1, π i

2 LDreg h 1 bwπ h, ωπ h 1, π LDreg h 1 E π h 1 d D h 1 ωπ h 1 , ωπ h 1, π + 2 Cs h 1Ca h 1 2 (εwreg)2,

Then unrolling Lemma E.1, for any h, we have for εwreg = c log(H|W|n/δ)

bwπ h ewπ h 1,d D, h 1

g<h 1 Cs g Ca g + 2 X

g<h Cs g Ca g

Lastly, we state and prove the intermediate results below. Lemma E.2 (Deviation bound for regression with squared loss from [HCJ23]). If {Wh} satisﬁes Asm. E.1, then with probability 1 δ, for any h [H], there exists a universal constant c such that E h LDreg h (wh+1, wh, π) LDreg h (E πwh, wh, π) i LDreg h (wh+1, wh, π) LDreg h (E πwh, wh, π)

2E h LDreg h (wh+1, wh, π) LDreg h (E π hwh, wh, π) i + c(Cs h Ca h)2 log (H|W|/δ)

Proof of Lemma E.1. Decompose

bwπ h ewπ h 1,d D, h 1

bwπ h E π h 1

d D h 1 σ bdπ h 1, Cs h 1 bd D h 1

ewπ h E π h 1

d D h 1 σ bdπ h 1, Cs h 1 bd D h 1

The ﬁrst term is the statistical error of regression. The second term reﬂects the bias between the population regression solution (involving plug-in estimates for the regression target) and our target weight function. Since edπ h = Ph 1 σ edπ h 1, Cs h 1d D h 1 , then ewπ h = Eh 1 σ edπ h 1, Cs h 1d D h 1 , for the second term we have ewπ h E π h 1

d D h 1 σ bdπ h 1, Cs h 1 bd D h 1

Eh 1 σ edπ h 1, Cs h 1d D h 1 E π h 1

d D h 1 σ bdπ h 1, Cs h 1 bd D h 1

Ph 1 σ edπ h 1, Cs h 1d D h 1 P π h 1

d D h 1 σ bdπ h 1, Cs h 1 bd D h 1

σ edπ h 1, Cs h 1d D h 1 d D h 1 σ bdπ h 1, Cs h 1 bd D h 1

Cs h 1 bd D h 1 d D h 1 1 + σ edπ h 1, Cs h 1d D h 1 σ bdπ h 1, Cs h 1 bd D h 1 1

2Cs h 1 bd D h 1 d D h 1 1 + edπ h 1 bdπ h 1 1 (Assumption 4.1)

Finally, since bdπ h 1 = bwπ h 1 bd D, h 2, edπ h 1 bdπ h 1 1 = ewπ h 1d D, h 2 bwπ h 1 bd D, h 2 1

Cs h 2Ca h 2 d D, h 2 bd D, h 2 1 + ewπ h 1 bwπ h 1 1,d D, h 2

Combining the inequalities completes the proof.

F Probabilistic Tools

Deﬁnition F.1 (Pseudodimension). Suppose a function class F RX , and xn 1 = {xi}n i=1 X n. We say xn 1 is pseudo-shattered by F if there exists v Rn such that for all y { 1, +1}n, there exists f F such that sign(f(xn 1 c)) = y. The pseudo-dimension of F is deﬁned as

d F = max{n N : xn 1 X n s.t. xn 1 is pseudo-shattered by F},

i.e., the cardinality of the largest set of points in X that F pseudo-shatters. Lemma F.1 (Lemma 26 from [HCJ23]). For b 1, let H (Z [ b, b]) be a hypothesis class and Zn = (z1, . . . , zn) Zn, where zi are iid samples drawn from a distribution supported on Z. Then for any h H, we have

640b, H, 40nb2

128V[h(z)] + 512εb

Lemma F.2. Fix π. For any h [H], consider functions yh : S A [ h G, h G]p and ρh : S A [0, Cs h Ca h] that depend only on the datasets Dmle <h and DFORC <h and Dgrad <h . Let G = {Gh} be function classes and with pseudo-dimension d G (Def. F.1). For any gh+1 Gh+1 and p [p], deﬁne the loss function

Lπ,p h (gh+1; yh, ρh) = 1

(s,a,s ) Dreg h

ρh(s, a) gp h+1(s ) ( log eπh(a|s) + yp h(s, a)) 2 .

Then with probability at least 1 δ, for all gh+1 Gh+1 and p [p] and h [H], we have E[Lπ,p h (gh+1; yh, ρh) Lπ,p h (g h+1; yh, ρh)] Lπ,p h (gh+1; yh, ρh) Lπ,p h (g h+1; yh, ρh)

2E[Lπ,p h (gh+1; yh, ρh) Lπ,p h (g h+1; yh, ρh)] + (εreg h+1)2,

where g h+1 = ED,ρ h ( log eπh + yh) and for some absolute constant c,

d GCs h 1Ca h 1h2G2 log(np H/δ)

Proof. Fix Dmle <h and DFORC <h and Dgrad <h . We ﬁrst prove the stated bound conditioned on these datasets, which means gπ h and ρπ h are ﬁxed, and the randomness below comes from random draws of Dgrad h . Consider the following hypothesis class induced by Gh+1:

Z(ρh, gh+1, yh) = n ρh(s, a) (gh+1(s ) yh(s, a))2 g h+1(s ) yh(s, a) 2 : gh+1 Gh+1 o .

and for any Z Z, we have |Z| 2Cs h Ca h gh+1 2 + yh 2 4Cs h Ca hh2G2. We also have

E[Zp(ρ, gh+1, yh)] = yp h+1 g ,p h+1 2 2,f π h . Further,

V [Zp(ρh, gh+1, yh)] E[Zp(ρh, gh+1, yh)2]

= E ρh(s, a)2 gp h+1(s ) yh(s, a) 2 g ,p h+1(s ) yh(s, a) 2 2

= E h ρh(s, a)2 gp h+1(s ) 2yh(s, a) + g ,p h+1(s ) 2 gp h+1(s ) g ,p h+1(s ) 2i

16Cs h Ca hh2G2E h ρh(s, a) gp h+1(s ) g ,p h+1(s ) 2i

= 16Cs h Ca hh2G2E[Zp(ρh, gh+1, yh)]

Next, we show that the uniform covering number of Z can be bounded by the uniform covering number of G, since for any gh+1, g h+1 Gh+1 we have Zp(ρh, gh+1, yh) Zp(ρh, g h+1, yh) = ρh(s, a) (gh+1(s ) yh(s, a))2 (g h+1(s ) yh(s, a))2

16Cs h Ca h(h + 1)2G2|gh+1(s ) g h+1(s )|

In other words, any γ/16Cs h 1Ca h 1h2G2 covering of Gh is a covering of Zh. Then combining the above with Lem. F.1, we have

E[Zp(ρh 1, gh, yh 1)] 1

i Zp i (ρh 1, gh, yh 1)

10240Cs h Ca hh2G2 , Z(Gh, ρh 1, yh 1), 640n(Cs h 1Ca h 1)2h4G4

2048Cs h 1Ca h 1h2G2E[Zp(ρh 1, gh, yh 1, π)] + 2048εCs h 1Ca h 1h2G2

163840(Cs h Ca h)2h4G4 , Gh, 640n(Cs h 1Ca h 1)2h4G4

2048Cs h 1Ca h 1h2G2E[Zp(ρh 1, gh, yh 1)] + 2048εCs h 1Ca h 1h2G2

Deﬁne N := N1 ε3 163840(Cs h Ca h)2h4G4 , Gh, 640n(Cs h 1Ca h 1)2h4G4

ε2 . Then setting the RHS equal to

δ , this implies that

n = 2048Cs h 1Ca h 1h2G2 (E[Zp(ρh 1, gh, yh 1)] + ε) log (36N/δ )

2048Cs h 1Ca h 1h2G2E[Zp(ρh 1, gh, yh 1)] log(36N/δ )

n + 2048Cs h 1Ca h 1h2G2 log(36N/δ )

Since n 2048Cs h 1Ca h 1h2G2

ε , there exists an absolute constant c such that log(36N/δ ) cd G log(n/δ ). Then with probability at least 1 δ , E[Zp(ρh 1, gh, yh 1)] 1

i Zp i (ρh 1, gh, yh 1)

2048cd Gh Cs h 1Ca h 1h2G2E[Zp(ρh 1, gh, yh 1)] log(n/δ )

n + 2048cd GCs h 1Ca h 1h2G2 log(n/δ )

2E[Zp(ρh 1, gh, yh 1)] + 3072cd GCs h 1Ca h 1h2G2 log(n/δ )

n Since the above bound holds for a ﬁxed datasets, it also holds for the expectation over the datasets. Applying the above bound for all p [p] and h [H] with δ = δ/Hp and taking the union bound, then plugging in the deﬁnition of Zp, gives the result.

Lemma F.3. Fix h and denote the product class composed from G, W, F to be

Yh Ph = (y, ρ) : y = g 1 (wf , Cs hf) , ρ = σ(wf ,Cs hf) f , gh Gh, w Wh, f Fh+1, f Fh

Fix π and p [p], and deﬁne the loss function

Lπ,p h (g; y, ρ) = 1

(s,a,s ) Dh ρ(s) σ(π(a|s),Ca hπD h (a|s)) πD h (a|s) (gp(s ) ( p log eπ(a|s) + yp(s)))2 .

Then with probability at least 1 δ, for all h [H], p [p] and g Gh+1, (y, ρ) Yh Ph, we have E[Lπ,p h (g; y, ρ) Lπ,p h (g h+1; y, ρ)] Lπ,p h (g; y, ρ) Lπ,p h (g h+1; y, ρ)

2E[Lπ,p h (g; y, ρ) Lπ,p h (g h+1; y, ρ)] + (εreg h+1)2,

where g h+1 = ED,ρ h ( log eπh + yh) and εreg h = O q

Cs h 1Ca h 1h2G2 log(N (n 1,G)p H|W||F|/δ)

Proof of Lem. F.3. Use the same notation for Zp as in the proof of Lem. F.2. Using Bernstein s inequality with a union bound over all Wh and Fh and the ℓ covers of Gh, Gh+1, with probability at least 1 δ we have E[Zp(ρh, gh+1, yh)] 1

i=1 Zp i (ρh, gh+1, yh)

2V [Zp(ρh, yh+1, gh)] log(|W||F|N (ε, G)/δ)

n + 4Cs h Ca hh2G2 log(|W||F|N (ε, G)/δ)

32Cs h Ca hh2G2E[Zp(ρh, yh+1, gh)] log(|W||F|N (ε, G)/δ)

n + 4Cs h Ca hh2G2 log(N (ε, G)|W||F|/δ)

By accounting for the ℓ covering error, we then have E[Zp(ρh, gh+1, yh)] 1

i=1 Zp i (ρh, gh+1, yh)

32Cs h Ca hh2G2E[Zp(ρh, gh+1, yh)] log(|W||F|N (ε, G)/δ)

+ 4Cs h Ca hh2G2 log(N (ε, G)|W||F|/δ)

3n + 16Cs h Ca hh2G2ε

Using the AM-GM inequality with ε = O(1/Cs h Ca hh2G2n) gives the result.

G Optimization Tools

Deﬁnition G.1 (Gradient mapping).

Gη(x, g) := 1

η (x Proj X (x + ηg)) (27)

Lemma G.1 (Stationary convergence of PGD). Suppose f : X R is β-smooth over X, a nonempty closed and convex set, and that we have access to a gradient oracle such that E[g(x)|x] = f(x) and E[ g(x) f(x) 2|x] ε2. Then if η = 1/β, we have

t E h Gη(x(t), f(x(t))) 2i 4β(f0 f )

Proof. For any x, deﬁne x+ = Proj X (x ηg(x)). Since x+ = prox ,IX (x ηg(x)), where IX is the indicator function for the set X, from Lem. G.2 we have x ηg(x) x+, x x+ 0.

Rearranging, this implies g(x), x+ x + 1

η x x+ 2 0.

Next, since f is β-smooth, for any x and x+ we have

f(x+) f(x) + f(x), x+ x + β

= f(x) + g(x), x+ x + f(x) g(x), x+ x + β

f(x) + η g(x) f(x), Gη(x, g(x)) + η2β

2 η Gη(x, g(x)) 2

= f(x) + η g(x) f(x), Gη(x, f(x)) + η g(x) f(x), Gη(x, g(x)) Gη(x, f(x))

2 η Gη(x, g(x)) 2

where we substitute the deﬁnition of Gη(x, g(x)) = 1

η(x x+) in the the second to last line. Notice that

g(x) f(x), Gη(x, g(x)) Gη(x, f(x)) g(x) f(x) Gη(x, g(x)) Gη(x, f(x))

g(x) f(x) 2

from the non-expansion of the projection operator. Then we have

f(x+) f(x) + η g(x) f(x), Gη(x, f(x)) + η g(x) f(x) 2 + η2β

2 η Gη(x, g(x)) 2

Next, we take the expectation of both sides conditioned on x.

E[f(x+)|x] f(x) + η E[g(x)|x] f(x), Gη(x, f(x))

+ ηE[ g(x) f(x) 2|x] + η2β

2 η E[ Gη(x, g(x)) 2|x]

f(x) + ηε2 + η2β

2 η E[ Gη(x, g(x)) 2|x]

Then unrolling the recursion through iterations and substituting η = 1/β, we have

t E[ Gη(x(t), g(x(t))) 2] 2β(f(x(0)) f(x(T )))

T + 2ε2 2β(f(x(0)) f(x ))

if f is nonnegative. Lastly,

t E[ Gη(x(t), f(x(t))) 2] = 1

t E[ Gη(x(t), g(x(t)) Gη(x(t), g(x(t)) + Gη(x(t), f(x(t))) 2]

t E[ Gη(x(t), g(x(t)) 2 + 2

t E[ Gη(x(t), g(x(t)) Gη(x(t), f(x(t))) 2]

4β(f(x(0)) f(x ))

T + 4ε2 + 2

t E[ g(x(t)) f(x(t)) 2]

4β(f(x(0)) f(x ))

Lemma G.2 (Theorem 6.39 from [Bec17]). Let g : E ( , ] be a proper closed and convex function. Then for any x, y E, the following three claims are equivalent:

1. y = proxg(x)

2. x y g(u)

3. x y, u y g(u) g(y) for any u E

Lemma G.3. Suppose f is M-gradient dominated and β-smooth, and

Gη(x(t), f(x(t))) 2 ε2,

where Gη(x, g) is the gradient mapping deﬁned in Def. G.1. Also, suppose x x 2 r for all x, x X. Then

n f(x ) f(x(t)) o r M(ηβ + 1)ε. (28)

Proof of Lemma G.3. If f is gradient dominated, for any t [T] we have

f(x ) f(x(t)) M max x X

D f(x(t)), x x(t)E .

Applying Lemma G.4 with f, we have

f(x(t)) NX (x(t)) + B (ηβ + 1) Gη(x(t 1), f(x(t 1)))

From the deﬁnition of the normal cone, we have v, x x(t) 0 for any v NX (x(t)) and x X. Then for any x X, D f(x(t)), x x(t)E (ηβ + 1) Gη(x(t 1), f(x(t 1))) x x(t) (ηβ + 1)r Gη(x(t 1), f(x(t 1)))

Combining the above inequalities,

n f(x ) f(x(t)) o (ηβ + 1)Mr min t [T ] Gη(x(t 1), f(x(t 1))) (ηβ + 1)Mrε.

Lemma G.4 (Lemma 3 from [GL16]). Let f : Rd ( , ) be be β-smooth over a convex set X For any t [T], consider x(t+1) = Proj X x(t) η f(x(t)) . Then

f(x(t+1)) NX (x(t+1)) + B (ηβ + 1) Gη(x(t), f(x(t)) ,

where NX is the normal cone of X and B(r) = {x Rd : x 2 r}.

Proof of Lemma G.4. Projected gradient descent can be equivalently written as [Bec17]

Proj X x(t) η f(x(t)) = argmin x Rd

f(x(t)) + D f(x(t)), x x(t)E + 1

2η x x(t) 2 2 + IX (x) ,

where IX (x) = 0 if x X, and + otherwise, is the indicator function for X. Then by the subgradient optimality condition, we have

0 f(x(t)) + 1

η(x(t+1) x(t)) + NX (x(t+1))

With some rearrangement, this implies that

f(x(t+1)) NX (x(t+1)) + f(x(t)) f(x(t+1)) + 1

η(x(t+1) x(t))

which implies the lemma statement since

f(x(t)) f(x(t+1)) + 1

η(x(t+1) x(t)) β x(t) x(t+1) + 1

η x(t) x(t+1)

(ηβ + 1) Gη(x(t), f(x(t)))

using the β-smoothness of f in the ﬁrst inequality, and Deﬁnition G.1 in the second.

Lemma G.5. Suppose f is β-smooth and that at each iteration t, we have g(t) from a gradient oracle such that E g(t)|x(t) = f(x(t)) and E f(x(t)) g(t) 2|x(t) ε2 for all t [T]. Then gradient ascent using {g(t)} satisﬁes

t=1 E f(x(t)) 2 2β(f0 f )

T + ε2. (29)

Proof of Lem. G.5. From the β-smoothness of f,

f(x(t+1)) f(x(t)) + D f(x(t)), x(t+1) x(t)E + β

2 x(t+1) x(t) 2

= f(x(t)) η D f(x(t)), g(t)E + βη2

= f(x(t)) η D f(x(t)), f(x(t)) f(x(t)) + g(t)E + βη2

2 f(x(t)) f(x(t)) + g(t) 2

= f(x(t)) + βη2

2 η f(x(t)) 2 + βη2 η D f(x(t)), g(t) f(x(t)) E + βη2

2 f(x(t)) g(t) 2

Taking the expectations of both sides conditioned on x(t) (prior histories), we have

E[f(x(t+1))|x(t)] f(x(t)) + βη2

2 η f(x(t)) 2 + βη2ε2

Substituting η = 1/β, unrolling through iterations, and using the law of total expectation gives the result.

Neur IPS Paper Checklist

Question: Do the main claims made in the abstract and introduction accurately reﬂect the paper s contributions and scope?

Answer: [Yes]

Justiﬁcation: The introduction contains a list of our contributions and matches our theoretical results. The abstract summarizes them.

Guidelines:

The answer NA means that the abstract and introduction do not include the claims made in the paper. The abstract and/or introduction should clearly state the claims made, including the contributions made in the paper and important assumptions and limitations. A No or NA answer to this question will not be perceived well by the reviewers. The claims made should match theoretical and experimental results, and reﬂect how much the results can be expected to generalize to other settings. It is ﬁne to include aspirational goals as motivation as long as it is clear that these goals are not attained by the paper.

2. Limitations

Question: Does the paper discuss the limitations of the work performed by the authors?

Answer: [Yes]

Justiﬁcation: We provide justiﬁcations for all of our assumptions, and discuss the ramiﬁcations of our theorems.

Guidelines:

The answer NA means that the paper has no limitation while the answer No means that the paper has limitations, but those are not discussed in the paper. The authors are encouraged to create a separate "Limitations" section in their paper. The paper should point out any strong assumptions and how robust the results are to violations of these assumptions (e.g., independence assumptions, noiseless settings, model well-speciﬁcation, asymptotic approximations only holding locally). The authors should reﬂect on how these assumptions might be violated in practice and what the implications would be. The authors should reﬂect on the scope of the claims made, e.g., if the approach was only tested on a few datasets or with a few runs. In general, empirical results often depend on implicit assumptions, which should be articulated. The authors should reﬂect on the factors that inﬂuence the performance of the approach. For example, a facial recognition algorithm may perform poorly when image resolution is low or images are taken in low lighting. Or a speech-to-text system might not be used reliably to provide closed captions for online lectures because it fails to handle technical jargon. The authors should discuss the computational efﬁciency of the proposed algorithms and how they scale with dataset size. If applicable, the authors should discuss possible limitations of their approach to address problems of privacy and fairness. While the authors might fear that complete honesty about limitations might be used by reviewers as grounds for rejection, a worse outcome might be that reviewers discover limitations that aren t acknowledged in the paper. The authors should use their best judgment and recognize that individual actions in favor of transparency play an important role in developing norms that preserve the integrity of the community. Reviewers will be speciﬁcally instructed to not penalize honesty concerning limitations.

3. Theory Assumptions and Proofs

Question: For each theoretical result, does the paper provide the full set of assumptions and a complete (and correct) proof?

Answer: [Yes]

Justiﬁcation: We fully disclose all assumptions and discuss their strengths and weaknesses. While theorems/lemmas in the main text are not cross-referenced with pointers to their proofs in the appendix, the appendix is organized with a table of contents according to section, and proofs are clearly labeled with the theorem or lemma they pertain to. We did not have space to provide proof sketches in the main text, but, where possible, we attempted to provide a brief intuition.

Guidelines:

The answer NA means that the paper does not include theoretical results. All the theorems, formulas, and proofs in the paper should be numbered and crossreferenced. All assumptions should be clearly stated or referenced in the statement of any theorems. The proofs can either appear in the main paper or the supplemental material, but if they appear in the supplemental material, the authors are encouraged to provide a short proof sketch to provide intuition. Inversely, any informal proof provided in the core of the paper should be complemented by formal proofs provided in appendix or supplemental material. Theorems and Lemmas that the proof relies upon should be properly referenced.

4. Experimental Result Reproducibility

Question: Does the paper fully disclose all the information needed to reproduce the main experimental results of the paper to the extent that it affects the main claims and/or conclusions of the paper (regardless of whether the code and data are provided or not)?

Answer: [Yes]

Justiﬁcation: We do not have experimental results other than the graph in Figure 1, which is a plot of 1-D functions that are fully disclosed in the caption and referenced proposition.

Guidelines:

The answer NA means that the paper does not include experiments. If the paper includes experiments, a No answer to this question will not be perceived well by the reviewers: Making the paper reproducible is important, regardless of whether the code and data are provided or not. If the contribution is a dataset and/or model, the authors should describe the steps taken to make their results reproducible or veriﬁable. Depending on the contribution, reproducibility can be accomplished in various ways. For example, if the contribution is a novel architecture, describing the architecture fully might sufﬁce, or if the contribution is a speciﬁc model and empirical evaluation, it may be necessary to either make it possible for others to replicate the model with the same dataset, or provide access to the model. In general. releasing code and data is often one good way to accomplish this, but reproducibility can also be provided via detailed instructions for how to replicate the results, access to a hosted model (e.g., in the case of a large language model), releasing of a model checkpoint, or other means that are appropriate to the research performed. While Neur IPS does not require releasing code, the conference does require all submissions to provide some reasonable avenue for reproducibility, which may depend on the nature of the contribution. For example (a) If the contribution is primarily a new algorithm, the paper should make it clear how to reproduce that algorithm. (b) If the contribution is primarily a new model architecture, the paper should describe the architecture clearly and fully. (c) If the contribution is a new model (e.g., a large language model), then there should either be a way to access this model for reproducing the results or a way to reproduce the model (e.g., with an open-source dataset or instructions for how to construct the dataset).

(d) We recognize that reproducibility may be tricky in some cases, in which case authors are welcome to describe the particular way they provide for reproducibility. In the case of closed-source models, it may be that access to the model is limited in some way (e.g., to registered users), but it should be possible for other researchers to have some path to reproducing or verifying the results.

5. Open access to data and code

Question: Does the paper provide open access to the data and code, with sufﬁcient instructions to faithfully reproduce the main experimental results, as described in supplemental material?

Answer: [NA]

Justiﬁcation: We do not believe that Figure 1 constitutes as an experiment that requires code.

Guidelines:

The answer NA means that paper does not include experiments requiring code. Please see the Neur IPS code and data submission guidelines (https://nips.cc/ public/guides/Code Submission Policy) for more details. While we encourage the release of code and data, we understand that this might not be possible, so No is an acceptable answer. Papers cannot be rejected simply for not including code, unless this is central to the contribution (e.g., for a new open-source benchmark). The instructions should contain the exact command and environment needed to run to reproduce the results. See the Neur IPS code and data submission guidelines (https: //nips.cc/public/guides/Code Submission Policy) for more details. The authors should provide instructions on data access and preparation, including how to access the raw data, preprocessed data, intermediate data, and generated data, etc. The authors should provide scripts to reproduce all experimental results for the new proposed method and baselines. If only a subset of experiments are reproducible, they should state which ones are omitted from the script and why. At submission time, to preserve anonymity, the authors should release anonymized versions (if applicable). Providing as much information as possible in supplemental material (appended to the paper) is recommended, but including URLs to data and code is permitted.

6. Experimental Setting/Details

Question: Does the paper specify all the training and test details (e.g., data splits, hyperparameters, how they were chosen, type of optimizer, etc.) necessary to understand the results?

Answer: [NA]

Justiﬁcation: Per our answer to the previous question, we do not have experiments.

Guidelines:

The answer NA means that the paper does not include experiments. The experimental setting should be presented in the core of the paper to a level of detail that is necessary to appreciate the results and make sense of them. The full details can be provided either with the code, in appendix, or as supplemental material.

7. Experiment Statistical Signiﬁcance

Question: Does the paper report error bars suitably and correctly deﬁned or other appropriate information about the statistical signiﬁcance of the experiments?

Answer: [NA]

Justiﬁcation: Per the answer to the previous question, we do not have experiments with statistical error.

Guidelines:

The answer NA means that the paper does not include experiments. The authors should answer "Yes" if the results are accompanied by error bars, conﬁdence intervals, or statistical signiﬁcance tests, at least for the experiments that support the main claims of the paper. The factors of variability that the error bars are capturing should be clearly stated (for example, train/test split, initialization, random drawing of some parameter, or overall run with given experimental conditions). The method for calculating the error bars should be explained (closed form formula, call to a library function, bootstrap, etc.) The assumptions made should be given (e.g., Normally distributed errors). It should be clear whether the error bar is the standard deviation or the standard error of the mean. It is OK to report 1-sigma error bars, but one should state it. The authors should preferably report a 2-sigma error bar than state that they have a 96% CI, if the hypothesis of Normality of errors is not veriﬁed. For asymmetric distributions, the authors should be careful not to show in tables or ﬁgures symmetric error bars that would yield results that are out of range (e.g. negative error rates). If error bars are reported in tables or plots, The authors should explain in the text how they were calculated and reference the corresponding ﬁgures or tables in the text.

8. Experiments Compute Resources

Question: For each experiment, does the paper provide sufﬁcient information on the computer resources (type of compute workers, memory, time of execution) needed to reproduce the experiments?

Answer: [NA]

Justiﬁcation: Per the previous answer, we do not have experiments requiring computational resources.

Guidelines:

The answer NA means that the paper does not include experiments. The paper should indicate the type of compute workers CPU or GPU, internal cluster, or cloud provider, including relevant memory and storage. The paper should provide the amount of compute required for each of the individual experimental runs as well as estimate the total compute. The paper should disclose whether the full research project required more compute than the experiments reported in the paper (e.g., preliminary or failed experiments that didn t make it into the paper).

9. Code Of Ethics

Question: Does the research conducted in the paper conform, in every respect, with the Neur IPS Code of Ethics https://neurips.cc/public/Ethics Guidelines?

Answer: [Yes]

Justiﬁcation: We do not believe we deviate from the Code of Ethics.

Guidelines:

The answer NA means that the authors have not reviewed the Neur IPS Code of Ethics. If the authors answer No, they should explain the special circumstances that require a deviation from the Code of Ethics. The authors should make sure to preserve anonymity (e.g., if there is a special consideration due to laws or regulations in their jurisdiction).

10. Broader Impacts

Question: Does the paper discuss both potential positive societal impacts and negative societal impacts of the work performed?

Answer: [NA]

Justiﬁcation: Our paper is purely theoretical, and we do not believe there is a societal impact to be discussed. We do not see a direct path to negative applications. Guidelines:

The answer NA means that there is no societal impact of the work performed. If the authors answer NA or No, they should explain why their work has no societal impact or why the paper does not address societal impact. Examples of negative societal impacts include potential malicious or unintended uses (e.g., disinformation, generating fake proﬁles, surveillance), fairness considerations (e.g., deployment of technologies that could make decisions that unfairly impact speciﬁc groups), privacy considerations, and security considerations. The conference expects that many papers will be foundational research and not tied to particular applications, let alone deployments. However, if there is a direct path to any negative applications, the authors should point it out. For example, it is legitimate to point out that an improvement in the quality of generative models could be used to generate deepfakes for disinformation. On the other hand, it is not needed to point out that a generic algorithm for optimizing neural networks could enable people to train models that generate Deepfakes faster. The authors should consider possible harms that could arise when the technology is being used as intended and functioning correctly, harms that could arise when the technology is being used as intended but gives incorrect results, and harms following from (intentional or unintentional) misuse of the technology. If there are negative societal impacts, the authors could also discuss possible mitigation strategies (e.g., gated release of models, providing defenses in addition to attacks, mechanisms for monitoring misuse, mechanisms to monitor how a system learns from feedback over time, improving the efﬁciency and accessibility of ML). 11. Safeguards

Question: Does the paper describe safeguards that have been put in place for responsible release of data or models that have a high risk for misuse (e.g., pretrained language models, image generators, or scraped datasets)? Answer: [NA] Justiﬁcation: We do not use data or train models. Guidelines:

The answer NA means that the paper poses no such risks. Released models that have a high risk for misuse or dual-use should be released with necessary safeguards to allow for controlled use of the model, for example by requiring that users adhere to usage guidelines or restrictions to access the model or implementing safety ﬁlters. Datasets that have been scraped from the Internet could pose safety risks. The authors should describe how they avoided releasing unsafe images. We recognize that providing effective safeguards is challenging, and many papers do not require this, but we encourage authors to take this into account and make a best faith effort. 12. Licenses for existing assets

Question: Are the creators or original owners of assets (e.g., code, data, models), used in the paper, properly credited and are the license and terms of use explicitly mentioned and properly respected? Answer: [NA] Justiﬁcation: We do not have code, data, or models. Guidelines:

The answer NA means that the paper does not use existing assets. The authors should cite the original paper that produced the code package or dataset. The authors should state which version of the asset is used and, if possible, include a URL.

The name of the license (e.g., CC-BY 4.0) should be included for each asset. For scraped data from a particular source (e.g., website), the copyright and terms of service of that source should be provided. If assets are released, the license, copyright information, and terms of use in the package should be provided. For popular datasets, paperswithcode.com/datasets has curated licenses for some datasets. Their licensing guide can help determine the license of a dataset. For existing datasets that are re-packaged, both the original license and the license of the derived asset (if it has changed) should be provided. If this information is not available online, the authors are encouraged to reach out to the asset s creators.

13. New Assets

Question: Are new assets introduced in the paper well documented and is the documentation provided alongside the assets?

Answer: [NA]

Justiﬁcation: There are no new assets.

Guidelines:

The answer NA means that the paper does not release new assets. Researchers should communicate the details of the dataset/code/model as part of their submissions via structured templates. This includes details about training, license, limitations, etc. The paper should discuss whether and how consent was obtained from people whose asset is used. At submission time, remember to anonymize your assets (if applicable). You can either create an anonymized URL or include an anonymized zip ﬁle.

14. Crowdsourcing and Research with Human Subjects

Question: For crowdsourcing experiments and research with human subjects, does the paper include the full text of instructions given to participants and screenshots, if applicable, as well as details about compensation (if any)?

Answer: [NA]

Justiﬁcation: We do not use human subjects.

Guidelines:

The answer NA means that the paper does not involve crowdsourcing nor research with human subjects. Including this information in the supplemental material is ﬁne, but if the main contribution of the paper involves human subjects, then as much detail as possible should be included in the main paper. According to the Neur IPS Code of Ethics, workers involved in data collection, curation, or other labor should be paid at least the minimum wage in the country of the data collector.

15. Institutional Review Board (IRB) Approvals or Equivalent for Research with Human Subjects

Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or institution) were obtained?

Answer: [NA]

Justiﬁcation: We do not use human subjects.

Guidelines:

The answer NA means that the paper does not involve crowdsourcing nor research with human subjects.

Depending on the country in which research is conducted, IRB approval (or equivalent) may be required for any human subjects research. If you obtained IRB approval, you should clearly state this in the paper. We recognize that the procedures for this may vary signiﬁcantly between institutions and locations, and we expect authors to adhere to the Neur IPS Code of Ethics and the guidelines for their institution. For initial submissions, do not include any information that would break anonymity (if applicable), such as the institution conducting the review.