# recurrent_natural_policy_gradient_for_pomdps__4742eb71.pdf

Published in Transactions on Machine Learning Research (10/2025)

Recurrent Natural Policy Gradient for POMDPs

Semih Cayci cayci@mathc.rwth-aachen.de Department of Mathematics RWTH Aachen University

Atilla Eryilmaz eryilmaz.2@osu.edu Department of Electrical and Computer Engineering The Ohio State University

Reviewed on Open Review: https: // openreview. net/ forum? id= 6G01e0vg If

Solving partially observable Markov decision processes (POMDPs) remains a fundamental challenge in reinforcement learning (RL), primarily due to the curse of dimensionality induced by the non-stationarity of optimal policies. In this work, we study a natural actorcritic (NAC) algorithm that integrates recurrent neural network (RNN) architectures into a natural policy gradient (NPG) method and a temporal difference (TD) learning method. This framework leverages the representational capacity of RNNs to address non-stationarity in RL to solve POMDPs while retaining the statistical and computational efficiency of natural gradient methods in RL. We provide non-asymptotic theoretical guarantees for this method, including bounds on sample and iteration complexity to achieve global optimality up to function approximation. Additionally, we characterize pathological cases that stem from long-term dependencies, thereby explaining limitations of RNN-based policy optimization for POMDPs.

1 Introduction

Reinforcement learning (RL) for partially observable Markov decision processes (POMDPs) has been a particularly challenging problem due to the absence of an optimal stationary policy, which leads to a curse of dimensionality as the space of non-stationary policies grows exponentially over time (Krishnamurthy, 2016; Murphy, 2000). To address this curse of dimensionality in solving POMDPs, finite-memory (Yu & Bertsekas, 2008; Yu, 2012; Kara & Yüksel, 2023; Cayci et al., 2024a) and RNN-based (Lin & Mitchell, 1993; Whitehead & Lin, 1995; Wierstra et al., 2010; Mnih et al., 2014; Ni et al., 2021; Lu et al., 2024) model-free RL approaches are widely used to solve POMDPs. Despite the empirical success of RNN-based model-free RL methods, a rigorous theoretical understanding of their performance in the POMDP setting remains limited.

We begin by outlining two key observations that motivate our approach:

Observation 1. Recurrent neural networks (RNNs) have been extensively employed in model-free reinforcement learning (RL) to solve partially observable Markov decision processes (POMDPs) (Whitehead & Lin, 1995; Wierstra et al., 2010; Mnih et al., 2014). Recent work Ni et al. (2021) demonstrates that RNN-based model-free RL can perform competitively with more sophisticated and structured approaches under appropriate hyperparameter and architecture choices. In Lu et al. (2024), shortcomings of emerging transformers in solving POMDPs were demonstrated, and it was shown, somewhat surprisingly, that particular recurrent architectures can achieve superior practical performance in certain scenarios. However, despite this plethora of works that demonstrate the effectiveness of RNN-based model-free algorithms for solving POMDPs, a concrete theoretical understanding of these methods is still in a nascent stage. This is particularly important since, as noted by Ni et al. (2021), RNN-based model-free RL algorithms are sensitive to optimization parameters, and identification of provably good choices is important for practice.

Published in Transactions on Machine Learning Research (10/2025)

Observation 2. Natural policy gradient (NPG) framework has been shown to be effective in solving MDPs due to its versatility in encompassing powerful function approximators, such as deep neural networks (Wang et al., 2019; Cayci et al., 2024b). However, a naïve application of such non-recurrent model-free RL algorithms to solve POMDPs has been observed to be ineffective (Ni et al., 2021), which necessitate careful incorporation of recurrent architectures into the policy optimization framework. This calls for the need to incorporate and analyze policy optimization, particularly NPG framework, augmented with recurrent architectures, to obtain a provably effective solution for POMDPs.

Our study is motivated by these observations and guided by the following key questions, each addressed in this work:

Q1. How can we achieve (i) provably effective and (ii) computation/memory-efficient policy evaluation for non-stationary policies in partially observable environments? A temporal difference (TD) learning algorithm with an Ind RNN (Rec-TD) overcomes the so-called perceptual aliasing problem imperative in memoryless TD learning for POMDPs (Singh et al., 1994), and achieves near-optimal policy evaluation, provided a sufficiently large network (Theorem 5.4 and Remark 5.5). Our analysis identifies the exploding semi-gradients pathology in policy evaluation, which can significantly increase network and iteration complexities to mitigate perceptual aliasing under long-term dependencies (Remark 5.6), and demonstrates the role of regularization to mitigate this. We also provide empirical results in random-POMDP instances in Appendix C.

Q2. How can we parameterize non-stationary policies by a rich and practically feasible class of RNNs and perform efficient policy optimization?

We represent non-stationary policies using Ind RNNs with Soft Max parameterization as a form of finitestate controller, and perform computationally efficient NPG updates (based on path-based compatible function approximation for POMDPs) for policy optimization. The policy optimization update (called Rec-NPG) is aided by Rec-TD as the critic (Section 4).

Q3. What are the memory, computation and sample complexities of the resulting Rec-NAC method, which employs Rec-NPG for policy updates and Rec-TD for policy evaluation?

Our non-asymptotic analyses of Rec-TD (Theorem 5.4) and Rec-NPG (Theorem 6.3) demonstrate their near-optimality in the large-network limit while highlighting dependencies on memory, long-term POMDP dynamics, and RNN smoothness. Pathological cases with long-term dependencies may require exponentially growing resources (Remarks 5.6-6.4).

These results establish principled and scalable RL solutions for POMDPs, offering insights into the interplay between memory, smoothness, and optimization complexity.

1.1 Previous work

Natural policy gradient method, proposed by Kakade (2001), has been extensively investigated for MDPs (Agarwal et al., 2020; Cen et al., 2020; Khodadadian et al., 2021; Liu et al., 2020; Cayci et al., 2024c), and analyses of NPG with feedforward neural networks (FNNs) have been established by Wang et al. (2019); Liu et al. (2019); Cayci et al. (2024b). As these works consider MDPs, the policies are stationary. In our case, the analysis of RNNs and POMDPs constitute a very significant challenge.

Standard TD learning, which does not have a memory structure, was shown to be suboptimal for POMDPs (Singh et al., 1994). We incorporate RNNs into TD learning as a form of memory to address this problem in this work.

In Yu (2012); Singh et al. (1994); Uehara et al. (2022); Kara & Yüksel (2023); Cayci et al. (2024a), finitememory policies based on sliding-window approximations of the history were investigated. Bilinear frameworks with memory-based policies (Uehara et al., 2022) and Hilbert space embeddings with deterministic

Published in Transactions on Machine Learning Research (10/2025)

latent dynamics (Uehara et al., 2023) enable sample-efficient learning under specific model structures. In Guo et al. (2022), an offline RL algorithm for the specific class of linear POMDPs was proposed. Unlike these existing works, our approach integrates RNNs with NAC methods, providing a scalable and theoretically grounded framework for general POMDPs without requiring structural assumptions such as deterministic transitions, fixed memory windows, or linear POMDP dynamics. Valueand policy-based model-free RL algorithms based on RNNs have been widely considered in practice to solve POMDPs (Lin & Mitchell, 1993; Whitehead & Lin, 1995; Wierstra et al., 2010; Mnih et al., 2014; Ni et al., 2021; Lu et al., 2024). However, these works are predominantly experimental, thus there is no theoretical analysis of RNN-based RL methods for POMDPs to the best of our knowledge. In this work, we also present theoretical guarantees for RNNbased NPG for POMDPs. For structural results on the hardness of RL for POMDPs, we refer to (Liu et al., 2022; Singh et al., 1994).

1.2 Notation

For a finite set A, (A) = {v R|A| 0 : P

a A va = 1} is the set of probability vectors over the set A. Rad(α) = Unif{ α, α} for α R+.

2 Preliminaries on Partially Observable Markov Decision Processes

In this paper, we consider a discrete-time infinite-horizon partially observable Markov decision process (POMDP) with the (nonlinear) dynamics

P (St+1 = s|Sk, Ak, k t) =: P((St, At), s),

P(Yt = y|St) =: ϕ(St, y),

for any s S and y Y, where St is an S-valued state, Yt is a Y-valued observation, and At is an A-valued control process with the stochastic kernels P : S A S [0, 1] and ϕ : S Y [0, 1]. We consider finite but arbitrarily large A, Y and S, where A Rd1, Y Rd2

for some d1, d2 Z+ with d := d1 + d2, and (y, a) 2 1 for any (y, a) Y A. In this setting, the state process (St)t N is not observable by the controller. Let

( Y0, if t = 0, (Zt 1, At 1, Yt), if t > 0, (1)

be the history process, which is available to the controller at time t N, and

Zt := (Zt, At) = (Y0, A0, . . . , Yt, At), (2)

be the history-action process.

Definition 2.1 (Admissible policy). An admissible control policy π = (πt)t N is a sequence of measurable mappings πt : (Y A)t Y (A), and the control at time t is chosen under πt randomly as

P(At = a|Zt = zt) = πt(a|zt),

for any zt (Y A)t Y. We denote the class of all admissible policies by ΠNM.

If an action a is taken at state s, then a deterministic reward r(s, a) with |r(s, a)| r < is obtained.

Definition 2.2 (Value function, Q-function, advantage function). Let π be an admissible policy, and µ (Y). The value function under π with discount factor γ (0, 1) is defined as

Vπ t (zt) := Eπh X

k=t γk tr(Sk, Ak) Zt = zt i , (3)

Published in Transactions on Machine Learning Research (10/2025)

for any zt (Y A)t Y. Similarly, the state-action value function (also known as Q-function) and the advantage function under π are defined as

Qπ t ( zt) := Eπh X

k=t γk tr(Sk, Ak) Zt = zt i ,

Aπ t (zt, a) := Qπ t (zt, a) Vπ t (zt),

for any zt (Y A)t+1, respectively.

Given an initial observation distribution µ (Y), the optimization problem is

max π ΠNM Vπ(µ), (5)

where Vπ(µ) := X

y Y Vπ 0 (y0)µ(y0).

We denote an optimal policy as π arg max π ΠNM Vπ(µ).

Remark 2.3 (Curse of history in RL for POMDPs). Note that the problem in equation 5 is significantly more challenging than its subcase of (fully-observable) MDPs since there may not exist an optimal stationary policy (Krishnamurthy, 2016; Singh et al., 1994). As such, the policy search is over non-stationary randomized policies of type π = (π0, π1, . . .) where πt : (Y A)t Y (A) depends on the history of observations Zt = (Y0, A0, Y1, . . . , At 1, Yt) for t N. In this case, direct extensions of the existing reinforcement learning methods for MDPs become intractable, even for finite Y, A: the memory complexity of a non-stationary policy π ΠNM at epoch t N is O(|Y A|t+1), growing exponentially.

In the following section, we formally introduce the RNN architecture that we study in this paper.

3 Independently Recurrent Neural Network Architecture

We consider an independently recurrent neural network (Ind RNN) architecture in this work (Li et al., 2018; 2019). This architecture has been featured in POPGym (Morad et al., 2023) as it enables RNNs with large sequence lengths by handling long dependencies in practical applications. In other works, it has been shown to be effective for POMDPs in practice as well (Lu et al., 2024; Elelimy et al., 2024).

Let Xt = (Yt, At) Rd, therefore Zt = (X0, X1, . . . , Xt) for any t Z+ by equation 2. The central structure in an Ind RNN is the sequence of hidden states Ht = (H(1) t , H(2) 2 , . . . , H(m) t ) Rm for t = 0, 1, . . ., which evolves according to

H(i) t ( Zt; W, U) = ϱ Wii H(i) t 1( Zt 1; W, U) + Ui, Xt for all i [m], (6)

with the initial condition H(i) 0 ( Z0; W, U) := ϱ( Ui, X0 ), where ϱ : R R is a smooth activation function, W = diag(W11, W22, . . . , Wmm) and U is an m d matrix whose i-th row is U i for i [m]. We assume a smooth activation function ϱ with |ϱ(z)| ϱ0, |ϱ (z)| ϱ1 and |ϱ (z)| ϱ2 for all z R, which is satisfied by many widely-used activation functions including tanh and the sigmoid function. We consider a linear readout layer with weights c Rm, which leads to the output

Ft( Zt; W, U, c) = 1 m

i=1 ci H(i) t ( Zt; W, U). (7)

The operation of an independently recurrent neural network is illustrated in Figure 1. Following the neural tangent kernel literature, we omit the task of training the linear output layer c Rm for simplicity, and study the training dynamics of (W, U), which is the main challenge (Du et al., 2018; Oymak & Soltanolkotabi,

Published in Transactions on Machine Learning Research (10/2025)

Ht-1 Ht hidden state

input Ft output Yt At

Figure 1: An independently recurrent neural network (Ind RNN) in the RL context.

2020; Cai et al., 2019; Wang et al., 2019). Consequently, we denote the learnable parameters of an Ind RNN compactly in the vector form as

Θ1 Θ2 ... Θm

Rm(d+1) where Θi = Wii Ui

Rd+1 for i [m]. (8)

We use Θ and (W, U) interchangeably throughout the paper.

A key feature of the neural tangent kernel analysis is the random initialization (Bai & Lee, 2019; Chizat et al., 2019; Cayci et al., 2023).

Definition 3.1 (Symmetric random initialization). Let (c0, Θ0) = (c0 i , Θ0 i )i [m] be a random vector such that

c0 i iid Rad(1),

Θ0 i := W 0 ii U 0 i

iid Rad(α) N(0, Id)

c0 i+m/2 = c0 i and Θ0 i+m/2 = Θ0 i

for i = 1, 2, . . . , m

2 . We call (c0, Θ0) a symmetric random initialization, and denote the distribution of (c0, Θ0) as ζ0.

For both policy optimization (Algorithm 1) and policy evaluation (Algorithm 2), the Ind RNNs are randomly initialized according to Definition 3.1. Such random initialization schemes are widely adopted in practice, and play a fundamental role in the theoretical analysis of deep learning algorithms Bai & Lee (2019); Chizat et al. (2019); Wang et al. (2019); Cai et al. (2019); Liu et al. (2019).

In the following subsection, we define the reference function class determined by overparameterized Ind RNNs in a detailed way, which will be instrumental in the theoretical results and their analyses. We note that this subsection can be skipped for those who would like to focus on the algorithmic design.

3.1 Reference Function Class for Independently Recurrent Neural Networks

A fundamental question in reinforcement learning with function approximation is to determine a concrete reference function class for the function approximation architecture that is used for approximation in the value and policy spaces (Bertsekas & Tsitsiklis, 1996). In this subsection, we will identify and discuss the reference function class defined by the Ind RNN architecture that will be used for incorporating memory to solve POMDPs. In order to motivate the discussion, we first overview basic reference function classes for (fully-observable) MDPs, then extend the discussion to POMDPs.

Function approximation in MDPs. Let us consider value-based reinforcement learning in the case of MDPs, where the objective is to learn the Q-function under a given stationary policy π. The approximation

Published in Transactions on Machine Learning Research (10/2025)

error for a given reference class F of functions f : S A R is

ϵapp(F) := inf f F Es,a[(Qπ(s, a) f(s, a))2]. (9)

For example, if a linear function approximation scheme with a given feature map ϕ : S A Rp is used, then the reference function class is F := {(s, a) 7 θ ϕ(s, a) : θ Rp} = span(Φ) where Φ := [ϕ (s, a)]s,a is the feature matrix. In the case of linear MDPs Jin et al. (2020), we have Qπ F and ϵapp(F) = 0; otherwise TD(0) with this linear approximation scheme has an inevitable approximation error 1 1 γ ϵapp(F) (Bertsekas & Tsitsiklis, 1996). The reference function class for a randomly-initialized single hidden-layer feedforward neural network with frozen output layer is

FNTK := {(s, a) 7 Eu0 N(0,Id)[v(u0) uϱ( (s, a), u0 )] such that Eu0 N(0,Id)[ v(u0) 2 2] < }, (10)

where v : Rd Rd (Liu et al., 2019; Wang et al., 2019; Cayci et al., 2023). Technically, the completion of FNTK yields the reproducing kernel Hilbert space (RKHS) of the so-called neural tangent kernel

κ (x, x ) := Eu0[ u ϱ(u 0 x) uϱ(u 0 x )] = x x E[ϱ (u 0 x)ϱ (u 0 x )] for any x, x S A

and its explicit analysis shows that it is provably rich (Ji et al., 2019). For a detailed discussion on the function space FNTK and its role in reinforcement learning, we refer to Section A.2 in Liu et al. (2019) and Cayci et al. (2024b). Due to the concrete approximation bounds for FNTK, the representational assumption Qπ FNTK is standard in the theoretical analyses of neural TD learning for MDPs, and the objective is to prove that neural TD learning can learn any Qπ FNTK using samples with finite-time and finitesample guarantees (Cai et al., 2019; Wang et al., 2019; Cai et al., 2019; Cayci et al., 2023). Without the representational assumption Qπ FNTK, the optimality guarantees in Cai et al. (2019); Liu et al. (2019); Wang et al. (2019); Cayci et al. (2023) hold up to an additional error term proportional to 1 1 γ ϵapp(FNTK).

Function approximation in RL for POMDPs. Analogous to the approximation error analysis in RL for MDPs discussed earlier, our objective here is to identify a suitable reference function class for the Ind RNN architecture defined in equation 7. Building on the framework of Cayci & Eryilmaz (2025), we present an infinite-width characterization of Ind RNNs in the neural tangent kernel (NTK) regime. This directly extends the reference function class FNTK in 10 for feedforward neural networks in the neural RL literature (Cai et al., 2019; Wang et al., 2019; Cayci et al., 2024b) to the partially observable setting with recurrent models. We note that our reference function class reduces to the feedforward neural networks as a specific case (see Remark 3.4).

For any t Z+ and input Z, symmetric initialization ensures that Ft( Z; Θ0) = 0. Furthermore, the first-order Taylor expansion of Ft at Θ Rm(d+1) around Θ0 yields

Ft( Z; Θ) = ΘFt( Z; Θ0) Θ Θ0 + O Θ Θ0 2

As m , the linear part ΘFt( Z; Θ0) Θ Θ0 is able to approximate a rich class of functions determined by the reproducing kernel Hilbert space (RKHS) of the recurrent neural tangent kernel defined as

κt( Z, Z ) := lim m ΘFt( Z; Θ0) ΘFt( Z; Θ0),

for t Z+. In the following, we characterize this sequence of reproducing kernel Hilbert spaces for t Z+ explicitly, following Cayci & Eryilmaz (2025).

Let w0 Rad(α) and u0 N(0, Id) be independent random variables, and θ0 := (w0, u0). Given a sequence z = (x0, x1, . . .) (Y A)Z+, let

ht( zt; θ0) := ϱ(w0ht 1( zt 1; θ0) + u0, xt ) for t = 0, 1, 2, . . . ,

with the initial condition h 1 := 0. and

It( zt; θ0) := ϱ (w0ht 1( zt 1; θ0) + u0, xt ).

Published in Transactions on Machine Learning Research (10/2025)

Then, the neural tangent random feature mapping1 at time t is defined as

ψt( zt; θ0) :=

ht k 1( zt k 1; θ0) xt k

j=0 It j( zt j; θ0),

Based on the sequence of neural tangent random features, the neural tangent random feature matrix is defined as Ψ( z; θ0) = Ψ ( z; θ0), where

ΨT ( z; θ0) :=

ψ 0 ( z0; θ0) ψ 1 ( z1; θ0) ... ψ T 1( z T 1; θ0)

for any T Z+. Definition 3.2 (Transportation mapping). Let H be the set of mappings v : R1+d R1+d such that

v(θ0) := vw(θ0) vu(θ0)

for θ0 = (w0, u0) with E[ v(θ0) 2 2] < , where w0 Rad(α) and u0 N(0, Id). We call

v H a transportation mapping, following Ji & Telgarsky (2019); Ji et al. (2019). Definition 3.3 (Reference function class for Ind RNNs). We define the reference function class of Ind RNNs for any sequence-length T 1 as

z 7 E [ΨT ( z; θ0)v(θ0)] =

f 0 ( z0; v) ... f T 1( z T 1; v)

: v H , z (Y A)Z+

where f t ( zt; v) := E[ψ t ( zt; θ0)v(θ0)] for any z (Y A)Z+. The same transportation mapping v is used to define f t for all t N, which is a characteristic feature of weight-sharing in RNNs. We denote F := F . Remark 3.4 (Reduction to FNTK). Note that setting T = 1 yields the random feature map

ψt( z0; θ0) = 0 uϱ( u0, x0 )

since uϱ( x0, u0 ) = x0ϱ ( x0, u0 ). Hence, for any v H , we have

F1 = {x0 7 E[vu(u0) uϱ( x0, u0 )] : E vu(u0) 2 2 < },

which is exactly the reference function class FNTK for feedforward neural networks given in equation 10. In other words, {FT : T Z+} contains FNTK with F1 = FNTK, which is the reference function class in neural RL literature for MDPs (Wang et al., 2019; Liu et al., 2019). F1 is dense in the space of continuous functions on a compact set (Ji et al., 2019). Remark 3.5 (Fully-connected RNNs). Ind RNNs utilize a diagonal hidden-to-hidden weight matrix W, which was shown to be very effective in handling long-term dependencies in RL compared to conventional RNNs, GRU and LSTM architectures (Morad et al., 2023). In addition to its practical benefits, Ind RNNs have theoretical niceties as well, as they enable (i) explicit characterization of the reference function class, and (ii) direct control and analysis of the spectral radius of W. Both of these theoretical amenities are lost when W does not inherit a diagonal structure.

3.2 Max-Norm Projection for Ind RNNs

Given an initialization (W(0), U(0), c) as in Definition 3.1 and a vector ρ = (ρw, ρu) R2 >0 of projection radii, we define the compactly-supported set of weights Ωρ,m Rm(d+1) as

Ωρ,m = n Θ Rm(d+1) : max i |Wii Wii(0)| ρw m, max i Ui Ui(0) ρu m

1The feature uses a complicated weighted-sum of all past inputs xk, k t, leading to a discounted memory to tackle non-stationarity. xt k is scaled with wk 0 Rad(α), thus it yields a fading memory approximation of the history if α < 1.

Published in Transactions on Machine Learning Research (10/2025)

Given any symmetric random initialization (W(0), U(0), c) and ρ R2 >0, the set Ωρ,m is a compact and convex subset of Rm(d+1), and for any Θ Ωρ,m, we have

max 1 i m |Wii Wii(0)| ρw m,

max 1 i m Ui Ui(0) ρu m.

ProjΩρ,m[Θ] =

w B2 Wii(0), ρw m |Wii wi|, arg min

ui B2 Ui(0), ρu m Ui ui 2

As such, the projection operator ProjΩρ,m[ ] onto Ωρ,m is called the max-norm projection (or regularization) (Goodfellow et al., 2013; Srebro et al., 2004). As an immediate consequence, Θ Ωρ,m implies that |Wii| |Wii Wii(0)| + |Wii(0)| α + ρw m =: αm, which implies a strict control over maxi [m] |Wii|. As we will see in Section 5 and Section 6, such a strict control over the norm of the hidden-to-hidden weights Wii has a significant importance in stabilizing the training of Ind RNNs. Similar projection mechanisms for Ind RNNs are adopted in practice as well (Morad et al., 2023). For further details, we refer to Appendix A.

4 Rec-NAC: A High-Level Algorithmic View

In this section, we present a high-level description of our Recurrent Natural Actor-Critic (Rec-NAC) Algorithm with two inner loops, critic (called Rec-TD) and actor (called Rec-NPG), for policy optimization with RNNs. The details of the inner loops of the algorithm will be given in the succeeding sections. We use an admissible policy π = (πt)t N that is parameterized by a recurrent neural network (Ft( ; Φ))t N of the form given in equation 7 with a network width m Z+. To that end, for any t N, let

πΦ t (a|zt) := exp (Ft((zt, a); Φ)) P

a A exp (Ft((zt, a ); Φ)), (15)

for any zt (Y A)t Y and a A with the parameter Φ Rm(d+1). The high-level operation of Rec-NAC is summarized in Algorithm 1.

Algorithm 1 Recurrent Natural Actor-Critic (Rec-NAC) a High-level description

1: Initialize the actor RNN as (c, Φ(0)) ζ0 (see Definition 3.1). 2: for n = 0, 1, 2, . . . , N 1 do

3: Critic. Independently initialize the weights of the critic Ind RNN as (cn, Θn(0)) iid ζ0.

4: Run Rec-TD in Algorithm 2 for Ktd iterations, and obtain Θn := K 1 td P k<Ktd Θn(k)

5: Estimate QπΦ(n) t by ˆQ(n) t ( ) := Ft( ; Θn) for all t < T. 6: Actor. Apply projected-SGD to obtain

ωn argmin ω Ωρ,m Eπ µ

t=0 γt ln πn t (At|Zt)ω ˆ A(n) t ( Zt) 2 #

7: where the estimated advantage function is

ˆ A(n) t (zt, a) := ˆQ(n) t (zt, a) ˆV(n) t ( Zt),

8: for ˆQ(n) t ( ) := Ft( ; Θn) and ˆV(n) t ( ) := P

a A πΦ(n) t (a |zt) ˆQ(n) t ( , a ). 9: Policy update. Φ(n + 1) = Φ(n) + η ωn.

10: end for

Published in Transactions on Machine Learning Research (10/2025)

For information regarding the algorithmic tools, i.e., random initialization and max-norm regularization for RNNs, we refer to Section A.

In the following two sections, we derive the critic (Section 5) and the actor (Section 6) in full detail, and provide concrete performance bounds for these methods in each section.

5 Critic: Recurrent Temporal Difference Learning (Rec-TD)

In this section, we study a policy evaluation method for POMDPs, which will serve as the critic.

Policy evaluation problem. Consider the policy evaluation problem for POMDPs under a given admissible policy π ΠNM. Given an initial observation distribution µ (Y), policy evaluation aims to solve

min Θ Ωρ,m Rπ T (Θ) := Eπ µ

t=0 γt Ft( Zt; Θ) Qπ t ( Zt) 2 #

where T N is the sequence length (i.e., the length of the truncated trajectory Z), and {Ft : t N} is an Ind RNN given in equation 7 we drop the superscript a for simplicity throughout the discussion. The expectation in Rπ T (Θ) is with respect to the joint probability law P π,µ T of the stochastic process {(St, At, Yt) : t [0, T]} where Z0 µ.

5.1 Recurrent TD Learning Algorithm

In this section, we present a multi-step temporal difference learning algorithm for computing the sequence of state-action value functions {Qπ t : t N} for large POMDPs.

We assume access to a sampling oracle capable of generating independent trajectories from a given initial state distribution (Bhandari et al., 2018; Cai et al., 2019).

Assumption 5.1 (Sampling oracle). Given an initial state distribution µ, we assume that the system can be independently started from S0 µ, i.e., independent trajectories {(St, Yt, At) : t [T]} P π,µ T are obtained.

Rec-TD is presented in Algorithm 2. We study the performance of Rec-TD numerically in Section C under long-term and short-term dependencies to validate our theoretical results in Section 5.2. Remark 5.2 (Intuition behind Rec-TD). In a stochastic optimization setting, the loss-minimization for RT (Θ) would be solved by using gradient descent, where the gradient is

ΘRπ T (Θ) = 2Eπ µ

t=0 γt Ft( Zt; Θ) Qπ t ( Zt) Ft( Zt; Θ)

On the other hand, the target function Qπ t is unknown and to be learned. Following the bootstrapping idea for MDPs in Sutton (1988), we exploit an extended non-stationary Bellman equation in Proposition B.3, and use rt + γFt+1( Zt+1; Θ) as a bootstrap estimate for the unknown Qπ t ( Zt). Note that, in the realizable case with Ft( ; Θ ) = Qπ t ( ), t Z+ for some Θ , we have Eπ µ[ ˇ RT ( ZT ; Θ )] = 0, motivating the use of the stochastic approximation in this partially observable setting.

5.2 Theoretical Analysis of Rec-TD: Finite-Time Bounds and Global Near-Optimality

In the following, we prove that Rec-TD with max-norm regularization achieves global optimality in expectation. To characterize the impact of long-term dependencies on the performance of Rec-TD, let pt(x) = Pt 1 k=0 |x|k, and qt(x) = Pt 1 k=0(k + 1)|x|k, x R, t N.

In the following, we present a regularity condition on the state-action value functions.

Assumption 5.3 (Regularity of (Qπ t )t). {Qπ t : t N} F with a transportation mapping v = (vw, vu) H such that supθ Rd+1 vu(θ) 2 νu and supθ Rd+1 |vw(θ)| νw.

Published in Transactions on Machine Learning Research (10/2025)

Algorithm 2 Recurrent TD Learning Algorithm

1: Input: step-size η > 0, max-norm projection radius ρ = (ρw, ρu), sequence-length T. 2: Initialize (c, Θ(0)) ζ0 according to Definition 3.1. 3: for k = 0, 1, 2, . . . , K 1 do 4: Sample an initial state Sk 0 µ independently.

5: Observe Y k 0 Φ(Sk 0 , ).

6: Choose an action Ak 0 π0( |Zk 0 ). 7: Set ˇ Rk T := 0. 8: for t = 0, 1, . . . , T do 9: State transition Sk t+1 P((Sk t , Ak t ), ). 10: Observe Y k t+1 Φ(Sk t+1, ).

11: Choose an action Ak t+1 πt+1( |Zk t+1).

12: Compute temporal difference δt( Zk t , Θ(k)) where

δt( zt+1; Θ) := rt + γFt+1( zt+1; Θ) Ft( zt; Θ).

13: Update stochastic semi-gradient:

ˇ Rk T ˇ Rk T + γtδt( Zk t+1; Θ(k)).

14: end for 15: Parameter update with max-norm projection

Θ(k + 1) = ProjΩρ,m Θ(k) + η ˇ Rk T .

16: end for

Assumption 5.3 is a representational assumption, stating that (Qπ t )t lies in the RKHS induced by the random features ΨT ( z; θ0) defined in equation 12. It directly extends Assumption 4.1 in Wang et al. (2019) and Assumption 2 in Cayci et al. (2024b) to POMDPs, and exactly recovers these assumptions when T = 1 (see Remark 3.4).

Theorem 5.4 (Finite-time bounds for Rec-TD). Under Assumptions 5.1-5.3, for any projection radius ρ ν = (νw, νu) and step-size η > 0, Rec-TD with max-norm projection achieves the following error bound:

k=0 Rπ T (Θ(k)) i 1

ν 2 2 (1 γ) + C(1) T (1 γ)3

+ C(2) T (1 γ)2 m + γT

k=0 ω2 T,k | {z } ( )

for any K N, where C(1) T , C(2) T = poly p T ((α + ρwm 1/2)ϱ1), ρ 2, ν 2 ,

are instance-dependent constants that do not depend on K, and ωt,k := q

E[(Ft( Zt; Θ(k)) Qπ t ( Zk t ))2] is a

uniformly bounded sequence for t, k N. Furthermore, the loss at average-iterate, E[Rπ T 1 K PK 1 k=0 Θ(k) ], admits the same upper bound as the regret upper bound in equation 17, up to a multiplicative factor of 10.

The proof of Theorem 5.4 can be found in Section B.

Assumption 5.1 is critical to obtain finite-time bounds in Theorem 5.4, and holds when the system can be restarted independently from the initial state distribution Bhandari et al. (2018). In the specific case of fullyobservable MDPs, the process {(Sk, Ak) : k N} is a Markov chain under any stationary policy, and mixing time arguments under uniform ergodicity assumptions are used for analysis under Markovian sampling from a single trajectory without independent restarts (Bhandari et al., 2018; Cayci et al., 2023). On the other hand, in the case of POMDPs, {(Sk, Ak) : k N} is not a Markov chain under a general non-stationary

Published in Transactions on Machine Learning Research (10/2025)

policy π. In the specific case of policies parameterized by RNNs with hidden state {Hk : k N}, the augmented process {(Sk, Ak, Yk, Hk) : k N} forms a Markov process. The challenge here is that the state space for this augmented Markov process may be very large or even continuous, and standard theoretical tools (e.g., mixing time arguments) can become much more involved. Under Assumption 5.3, Theorem 5.4 implies the global ϵ-optimality of Rec-TD as the sequence-length T for sufficiently large number of iterations K = O(C(1) T /ϵ2) and network width m = O(C(2) T /ϵ2). If we omit Assumption 5.3, the error bound

in Theorem 5.4 still holds with an additional error term O 1 1 γ ϵapp(FT ) where

ϵapp(FT ) := inf f FT Eπ µ

t=0 γt ft( Zt) Qπ t ( Zt) 2 #

is the function approximation error. Remark 5.5 (Overcoming perceptual aliasing with Rec-TD). Memoryless TD learning suffers from a nonvanishing optimality gap in POMDPs, known as perceptual aliasing (Singh et al., 1994). To address this, Rec-TD integrates T-step stochastic approximation with an RNN, enabling it to retain memory. Accordingly, Theorem 5.4 establishes that as T , Rec-TD reduces Rπ to arbitrarily small values, given sufficiently large network width m and iteration count K.

Remark 5.6 (The impact of long-term dependencies). Note that both constants C(1) T , C(2) T polynomially depend on p T (ϱ1αm). As noted in Goodfellow et al. (2016), the spectral radius of {W(k) : k N} determines the degree of long-term dependencies in the problem as it scales Ht. Consistent with this observation, our bounds depend on

αm := α + ρw m λmax(W (k)W(k)) = max i [m] |Wii(k)|,

for any k N. Note that Theorem 5.4 requires ρw νw, thus maxi [m] |Wii(k)| should be sufficiently large depending on the RKHS norm ν. Let ε > 0 be any given target error.

Short-term memory. If αm < 1 ϱ1 , then it is easy to see that p T (ϱ1αm) 1 1 ϱ1αm . Thus, the extra term ( ) in equation 17 vanishes at a geometric rate as T , yet m (network-width) and K (iteration-complexity) are still O(1/ε2). Rec-TD is very efficient in that case.

Long-term memory. If αm > 1 ϱ1 , as T , both m and K grow at a rate O (ϱ1αm)T /ε2 while the extra term ( ) in equation 17 vanishes at a geometric rate. As such, the required network size and iterations grow at a geometric rate with T in systems with long-term memory, constituting the pathological case.

Theorem 5.4 emphasizes the critical importance of max-norm projection and large neural network size m in stabilizing the training of Ind RNNs by Rec-TD, and guides the choice of the projection radius ρ. Interestingly, if {Qπ t : t < T} FT has an RKHS norm νw 1/ϱ1, then Rec-TD with a projection radius ρw νw and overparameterization m 1 yields significantly improved policy evaluation performance in terms of C(1) T , C(2) T for large T. Similar projection mechanisms on {Wii : i [m]} are widely used for Ind RNNs in practice, for instance in Morad et al. (2023), to enhance stability.

The performance of Rec-TD is studied numerically in Random-POMDP instances in Section C.

6 Actor: Recurrent Natural Policy Gradient (Rec-NPG) for POMDPs

The goal is to solve the following problem for a given initial distribution µ (Y) and ρ R2 >0:

max Θ Rm(d+1) VπΦ(µ) such that Φ Ωρ,m, (PO)

Published in Transactions on Machine Learning Research (10/2025)

6.1 Recurrent Natural Policy Gradient for POMDPs

In this section, we describe the recurrent natural policy gradient (Rec-NPG) algorithm for non-stationary reinforcement learning. First, we formally establish in Prop. D.2 that the policy gradient under partial observability takes the form

ΦVπΦ(µ) := EπΦ µ

t=0 γt QπΦ t (Zt, At) Φ ln πΦ t (At|Zt)

where the state St in the MDP framework is replaced by the process history Zt in POMDP. Fisher information matrix under a policy πΦ is defined as

Gµ(Φ) := EπΦ µ

t=0 γt ln πΦ t (At|Zt) ln πΦ t (At|Zt)

for an initial observation distribution µ (Y). Rec-NPG updates the policy parameters by

Φ(n + 1) = Φ(n) + η G µ(Φ(n)) ΦVπΦ(n)(µ), (18)

for an initial parameter Φ(0) and step-size η > 0, where G denotes the Moore-Penrose inverse of a matrix G. This update rule is in the same spirit as the NPG introduced in Kakade (2001), however, due to the non-stationary nature of the partially observable MDP, it has significant complications that we will address.

In order to avoid computationally-expensive policy updates in equation 18, we utilize the following extension of the compatible function approximation in Kakade (2001) to the case of non-stationary policies for POMDPs.

Proposition 6.1 (Compatible function approximation for non-stationary policies). For any Φ Rm(d+1)

and initial observation distribution µ, let

Lµ(w; Φ) = EπΦ µ

t=0 γt ln πΦ t (At|Zt)ω AπΦ t ( Zt) 2 #

for ω Rm(d+1). Then, we have

G µ(Φ) ΦVπΦ(µ) arg min ω Rm(d+1)Lµ(ω; Φ). (20)

We have the following remark regarding the intricacies of compatible function approximation in the POMDP setting. Remark 6.2 (Path-based compatible function approximation with truncation). For MDPs, the compatible function approximation error Lµ(w; Φ) can be expressed by using the discounted state-action occupancy measure, from which one can obtain unbiased samples (Agarwal et al., 2020; Konda & Tsitsiklis, 2003). Thus, the infinite-horizon can be handled without any loss. On the other hand, for POMDPs as in equation 19, this simplification is impossible due to the non-stationarity. As such, we use a path-based method for a sequence-length T N with

ℓT (ω; Φ, Q) :=

t=0 γt( ln πΦ t (At|Zt)ω At(Zt, At))2,

where At(zt, at) = Qt(zt, at) P

a A πΦ t (a|zt)Qt(zt, a) is the advantage function.

Given a policy with parameter Φ(n), the corresponding output of the critic, which is obtained by Rec-TD with the average-iterate as

ˆQ(n)( ) := Ft( ; Θn) for Θn := 1 Ktd

k<Ktd Θn(k),

Published in Transactions on Machine Learning Research (10/2025)

the actor aims to solve the following problem:

min ω Ωρ,m E h ℓT ω; Φ(n), ˆQ(n) Θn, Φ(n), . . . , Φ(0) i .

We utilize stochastic gradient descent (SGD) to solve the above problem. Let Zn,k T P πΦ(n),µ T be an independent random sequence for k N, ˆωn(0) = 0, and

ωn(k + 1) = ˆωn(k) ηsgd ωℓT ˆωn(k); Φ(n), ˆQ(n) ,

ˆωn(k + 1) = ProjΩρ,m[ ωn(k + 1)],

A stochastic estimate of G µ(Φ(n)) ΦVπΦ(n)(µ) is computed as ωn := 1 Ksgd P

k<Ksgd ˆωn(k), followed by

Φ(n + 1) = Φ(n) + ηnpg ωn.

In the following, we present a theoretical analysis of this policy optimization algorithm.

6.2 Theoretical Analysis of Rec-NAC for POMDPs

We establish an error bound on the best-iterate for the Rec-NPG. The significance of the following result is two-fold: (i) it will explicitly connect the optimality gap to the compatible function approximation error, and (ii) it will explicitly show the impact of truncation on the performance of path-based policy optimization for the non-stationary case.

Theorem 6.3. Assume that P π ,µ T is absolutely continuous with respect to P πΦ(n),µ T for all n < N. Under this assumption, let

κ := max 0 n<N

P π ,µ T P πΦ(n),µ T

be the concentrability coefficient, and

Vn := Vπ (µ) VπΦ(n)(µ), n < N

be the optimality gap. Rec-NPG after N Z+ steps with step-size ηnpg = 1

N and projection radius ρ R2 >0 yields

min 0 n<N E0[Vn] ln |A| (1 γ)

N + ρ 2 2 1 γ p T (αmϱ1)

m 1 4 + γT r (1 γ)2 + κ N 1 γ

n=0 E0 εT cfa(Φ(n), ωn) 1

where E0 is the conditional expectation given the symmetric random initialization (c0, Φ(0)) ζ0, and

εT cfa(Φ, ω) := X

t<T γt| ln πΦ t (At|Zt)ω AπΦ t (Zt, At)|2.

Remark 6.4. We have the following remarks.

The effectiveness of Rec-NPG is proportional to the approximation power of the Ind RNN used for policy parameterization, as reflected in εT cfa in Theorem 6.3. We further characterize this error term in Propositions 6.6-6.8 in the following.

The terms Lt, βt, Λt, χt grow at a rate pt(ϱ1αm). Thus, if αm > ϱ 1 1 , then m and N should grow at a rate (αmϱ1)T , implying the curse of dimensionality (more generally, it is known as the exploding gradient problem Goodfellow et al. (2016)). On the other hand, if αm < ϱ 1 1 , then Lt, βt, Λt, χt are all O(1) for all t, implying efficient learning of POMDPs. This establishes a very interesting connection between the memory in the system, the continuity and smoothness of the RNN with respect to its parameters, and the optimality gap under Rec-NPG.

Published in Transactions on Machine Learning Research (10/2025)

The term 2γT r

(1 γ)2 is due to truncating the trajectory at T, and vanishes with large T.

Rec-NPG achieves ϵ-optimality (up to the compatible function approximation and truncation errors) with N = O(1/ϵ2) steps and m = O(1/ϵ4) neural network width for any ϵ > 0.

Remark 6.5. The quantity κ in Proposition 6.8 is the so-called concentrability coefficient in policy gradient methods (Agarwal et al., 2020; Bhandari & Russo, 2019; Wang et al., 2019), and determines the complexity of exploration. Note that it is defined in terms of path probabilities P π,µ T in the non-stationary setting. By making the assumption κ < , we assume that the policies πΦ(n) perform sufficient exploration to visit each trajectory visited by π with positive probability. In order to establish similar bounds without this assumption, entropic regularization is widely used to encourage exploration in practical scenarios Ahmed et al. (2019); Cen et al. (2020); Cayci et al. (2024c). The benefits of using entropic regularization in policy optimization for POMDPs to encourage exploration is an interesting future research direction.

In the following, we decompose the compatible function approximation error εT cfa into the approximation error for the RNN and the statistical errors. To that end, let

εapp,n = inf ω Ωρ,m E X

t<T γt Ft( Zt; Φ(0))ω QπΦ(n) t ( Zt) 2,

be the approximation error where the expectation is with respect to P πΦ(n),µ T ,

εtd,n = E[RπΦ(n) T ( Θ(n))|Φ(k), k n],

be the error in the critic (see equation 16), and finally let

εsgd,n = E[ℓT (ωn; Φ(n), ˆQ(n))| Θ(n), Φ(k), k n] inf w E[ℓT (ω; Φ(n), ˆQ(n))| Θ(n), Φ(k), k n],

be the error in the policy update via compatible function approximation.

Proposition 6.6 (Error decomposition for εT cfa). For any n Z+, we have

E h EπΦ(n) µ h ℓT (ωn; Φ(n), Q(n)) i Φ(k), k n i 8 ρ 2 2 m

t=0 γtβ2 t + 8εapp,n + 6εtd,n + 2εsgd,n.

From Theorem 5.4, we have, for ηtd = O(1/ Ktd),

εtd,n poly(p T (ϱ1αm))O 1 Ktd + 1 mcritic + γT ,

and by Theorem 14.8 in Shalev-Shwartz & Ben-David (2014), we have, for ηsgd = O(1/ p

εsgd,n poly(p T (ϱ1αm), ρ 2)O(1/ p

As such, the statistical errors in the critic and the policy update (i.e., εtd,n, εsgd,n) can be made arbitrarily small by using larger Ktd, Ksgd and larger mcritic. The remaining quantity to characterize is the approximation error, which is of critical importance for a small optimality gap as shown in Theorem 6.3 and Proposition 6.6. In the following, we will provide a finer characterization of εapp,n and identify a class of POMDPs that can be efficiently solved using Rec-NPG.

Assumption 6.7. For an index set J and ν R2 >0, we consider a class HJ,ν of transportation mappings

v(j) H : j J,

sup θ Rd+1,j J |v(j) w (θ)|

sup θ Rd+1,j J v(j) u (θ) 2

Published in Transactions on Machine Learning Research (10/2025)

and also the corresponding infinite-width limit

FJ,ν := { z 7 E[Ψ( z; θ0)v(θ0)] : v Conv(HJ,ν)},

where Ψ( ; θ0) is the NTRF matrix, defined in equation 12.

We assume that there exists an index set J and ν R2 >0 such that QπΦ(n) FJ,ν for all n N.

This representational assumption implies that the Q-functions under all iterate policies πΦ(n) throughout the Rec-NPG iterations n = 0, 1, . . . can be represented by convex combinations of a fixed set of mappings in the NTK function class F indexed by J. As we will see, the richness of J as measured by a relevant Rademacher complexity will play an important role in bounding the approximation error. To that end, for zt = (zt, at) (Y A)t+1, let

G zt t := {ϕ 7 ϕ H(1) t ( zt; ϕ)v(ϕ) : v HJ,ν},

Radm(G zt t ) := E ϵ Radm(1) Φ(0) ζinit

i=1 ϵig(Φi(0)).

Note that v HJ,ν above can be replaced with v Conv(HJ,ν) without any loss. In that case, since the mapping v(j) 7 f t ( zt; v(j)) G zt t is linear, G zt t is replaced with Conv(G zt t ) without changing the Rademacher complexity (Mohri et al., 2018).

The following provides a finer characterization of the approximation error.

Proposition 6.8. Under Assumption 6.7, if ρ ν, then

ϵapp,n 1 1 γ

2 max 0 t<T max zt (Y A)t+1 Radm(G zt t ) + LT ρ 2

ln (2T|Y A|T /δ)

for all n simultaneously with probability at least 1 δ over the random initialization for any δ (0, 1).

Remark 6.9. An interesting case that lead to a vanishing approximation error (as m ) is |J| < . Then, Proposition 6.8 reduces to Cayci et al. (2024b) (with T = 1 for FNNs) with the complexity term

by the finite-class lemma (Mohri et al., 2018). In this case, the Q-functions throughout

n = 0, 1, . . . lie in the convex hull of |J| fixed functions in F generated by {v(j) H : j J}. Remark 6.10. As noted in Cayci et al. (2024b), in a static problem (e.g., the regression problem in supervised learning or policy evaluation in Section 5) with a target function f F, the approximation error is easy to characterize: Ft( zt; Φ(0))ω ft( zt) = O

by Hoeffding inequality with ω := h 1 mciv(Φi(0)) i

In the dynamical policy optimization problem, the representational assumption QπΦ(n) F does not imply arbitrarily small approximation error as m since the function QπΦ(n) also depends on Φ(0). Thus,

Ft( zt; Φ(0))ω n =

H(i) t ( zt; Φ(0))vΦ(n)(Φi(0))

with ω n := [ 1 mcivΦ(n)(Φi(0))]i [m] for vΦ(n) H may not converge to the target function QπΦ(n) as m

because of the correlated H(i) t ( zt; Φ(0))vΦ(n)(Φi(0)) across i [m]. To address this, we characterize the uniform approximation error as in Proposition 6.8 for the random features of the actor RNN in approximating all QπΦ(n) for all n based on Rademacher complexity.

Published in Transactions on Machine Learning Research (10/2025)

7 Conclusion

We studied RNN-based policy evaluation and policy optimization methods with finite-time analyses, which demonstrate the effectiveness of the NPG method equipped with RNNs for POMDPs. An important limitation of Rec-NPG is that its memory and sample complexity significantly increases in POMDPs with long-term dependencies as pointed out in Remarks 5.6-6.4. In order to mitigate these issues, as an extension of this work, input normalization (Zucchet & Orvieto, 2024) and preconditioned Rec-TD updates to incorporate curvature information (Martens & Sutskever, 2011) are important directions for future research.

Acknowledgments

This work was funded by the Federal Ministry of Education and Research (BMBF) and the Ministry of Culture and Science of the German State of North Rhine-Westphalia (MKW) under the Excellence Strategy of the Federal Government and the Länder. Atilla Eryilmaz s research was supported in part by NSF AI Institute (AI-EDGE) 2112471, CNS-Ne TS-2106679; and the ARO Grant W911NF-24-1-0103.

Alekh Agarwal, Sham M Kakade, Jason D Lee, and Gaurav Mahajan. Optimality and approximation with policy gradient methods in markov decision processes. In Conference on Learning Theory, pp. 64 66. PMLR, 2020.

Zafarali Ahmed, Nicolas Le Roux, Mohammad Norouzi, and Dale Schuurmans. Understanding the impact of entropy on policy optimization. In International Conference on Machine Learning, pp. 151 160. PMLR, 2019.

Yu Bai and Jason D Lee. Beyond linearization: On quadratic and higher-order approximation of wide neural networks. ar Xiv preprint ar Xiv:1910.01619, 2019.

Dimitri P Bertsekas and John N Tsitsiklis. Neuro-dynamic programming. Athena Scientific, 1996.

Jalaj Bhandari and Daniel Russo. Global optimality guarantees for policy gradient methods. ar Xiv preprint ar Xiv:1906.01786, 2019.

Jalaj Bhandari, Daniel Russo, and Raghav Singal. A finite time analysis of temporal difference learning with linear function approximation. In Conference on learning theory, pp. 1691 1692. PMLR, 2018.

Qi Cai, Zhuoran Yang, Jason D Lee, and Zhaoran Wang. Neural temporal-difference learning converges to global optima. Advances in Neural Information Processing Systems, 32, 2019.

Semih Cayci and Atilla Eryilmaz. Convergence of gradient descent for recurrent neural networks: A nonasymptotic analysis. SIAM Journal on Mathematics of Data Science, 7(2):826 854, 2025.

Semih Cayci, Siddhartha Satpathi, Niao He, and Rayadurgam Srikant. Sample complexity and overparameterization bounds for temporal difference learning with neural network approximation. IEEE Transactions on Automatic Control, 2023.

Semih Cayci, Niao He, and R Srikant. Finite-time analysis of natural actor-critic for pomdps. SIAM Journal on Mathematics of Data Science, 6(4):869 896, 2024a.

Semih Cayci, Niao He, and R. Srikant. Finite-time analysis of entropy-regularized neural natural actorcritic algorithm. Transactions on Machine Learning Research, 2024b. ISSN 2835-8856. URL https: //openreview.net/forum?id=Bk Eqk7p S1I.

Semih Cayci, Niao He, and Rayadurgam Srikant. Convergence of entropy-regularized natural policy gradient with linear function approximation. SIAM Journal on Optimization, 34(3):2729 2755, 2024c.

Shicong Cen, Chen Cheng, Yuxin Chen, Yuting Wei, and Yuejie Chi. Fast global convergence of natural policy gradient methods with entropy regularization. ar Xiv preprint ar Xiv:2007.06558, 2020.

Published in Transactions on Machine Learning Research (10/2025)

Lenaic Chizat, Edouard Oyallon, and Francis Bach. On lazy training in differentiable programming. Advances in Neural Information Processing Systems, 32, 2019.

Thomas M Cover and Joy A Thomas. Elements of information theory (wiley series in telecommunications and signal processing), 2006.

Simon S Du, Xiyu Zhai, Barnabas Poczos, and Aarti Singh. Gradient descent provably optimizes overparameterized neural networks. In International Conference on Learning Representations, 2018.

Esraa Elelimy, Adam White, Michael Bowling, and Martha White. Real-time recurrent learning using trace units in reinforcement learning. ar Xiv preprint ar Xiv:2409.01449, 2024.

Ian Goodfellow, David Warde-Farley, Mehdi Mirza, Aaron Courville, and Yoshua Bengio. Maxout networks. In International conference on machine learning, pp. 1319 1327. PMLR, 2013.

Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep learning. MIT press, 2016.

Hongyi Guo, Qi Cai, Yufeng Zhang, Zhuoran Yang, and Zhaoran Wang. Provably efficient offline reinforcement learning for partially observable markov decision processes. In International Conference on Machine Learning, pp. 8016 8038. PMLR, 2022.

Ziwei Ji and Matus Telgarsky. Polylogarithmic width suffices for gradient descent to achieve arbitrarily small test error with shallow relu networks. ar Xiv preprint ar Xiv:1909.12292, 2019.

Ziwei Ji, Matus Telgarsky, and Ruicheng Xian. Neural tangent kernels, transportation mappings, and universal approximation. In International Conference on Learning Representations, 2019.

Chi Jin, Zhuoran Yang, Zhaoran Wang, and Michael I Jordan. Provably efficient reinforcement learning with linear function approximation. In Conference on learning theory, pp. 2137 2143. PMLR, 2020.

Sham Kakade and John Langford. Approximately optimal approximate reinforcement learning. In In Proc. 19th International Conference on Machine Learning. Citeseer, 2002.

Sham M Kakade. A natural policy gradient. Advances in neural information processing systems, 14, 2001.

Ali Devran Kara and Serdar Yüksel. Convergence of finite memory q learning for pomdps and near optimality of learned policies under filter stability. Mathematics of Operations Research, 48(4):2066 2093, 2023.

Sajad Khodadadian, Prakirt Raj Jhunjhunwala, Sushil Mahavir Varma, and Siva Theja Maguluri. On the linear convergence of natural policy gradient algorithm. ar Xiv preprint ar Xiv:2105.01424, 2021.

Vijay R Konda and John N Tsitsiklis. On actor-critic algorithms. SIAM journal on Control and Optimization, 42(4):1143 1166, 2003.

Vikram Krishnamurthy. Partially observed Markov decision processes. Cambridge university press, 2016.

Shuai Li, Wanqing Li, Chris Cook, Ce Zhu, and Yanbo Gao. Independently Recurrent Neural Network (Ind RNN): Building A Longer and Deeper RNN. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5457 5466, 2018.

Shuai Li, Wanqing Li, Chris Cook, Yanbo Gao, and Ce Zhu. Deep Independently Recurrent Neural Network (Ind RNN). ar Xiv preprint ar Xiv:1910.06251, 2019.

Long-Ji Lin and Tom M. Mitchell. Reinforcement Learning With Hidden States, pp. 269 278. The MIT Press, April 1993. ISBN 9780262287159. doi: 10.7551/mitpress/3116.003.0038. URL http://dx.doi. org/10.7551/mitpress/3116.003.0038.

Boyi Liu, Qi Cai, Zhuoran Yang, and Zhaoran Wang. Neural proximal/trust region policy optimization attains globally optimal policy. ar Xiv preprint ar Xiv:1906.10306, 2019.

Qinghua Liu, Alan Chung, Csaba Szepesvári, and Chi Jin. When is partially observable reinforcement learning not scary? In Conference on Learning Theory, pp. 5175 5220. PMLR, 2022.

Published in Transactions on Machine Learning Research (10/2025)

Yanli Liu, Kaiqing Zhang, Tamer Basar, and Wotao Yin. An improved analysis of (variance-reduced) policy gradient and natural policy gradient methods. Advances in Neural Information Processing Systems, 33: 7624 7636, 2020.

Chenhao Lu, Ruizhe Shi, Yuyao Liu, Kaizhe Hu, Simon S Du, and Huazhe Xu. Rethinking transformers in solving pomdps. ar Xiv preprint ar Xiv:2405.17358, 2024.

James Martens and Ilya Sutskever. Learning recurrent neural networks with hessian-free optimization. In Proceedings of the 28th international conference on machine learning (ICML-11), pp. 1033 1040, 2011.

Volodymyr Mnih, Nicolas Heess, Alex Graves, et al. Recurrent models of visual attention. Advances in neural information processing systems, 27, 2014.

Mehryar Mohri, Afshin Rostamizadeh, and Ameet Talwalkar. Foundations of machine learning. MIT press, 2018.

Steven Morad, Ryan Kortvelesy, Matteo Bettini, Stephan Liwicki, and Amanda Prorok. POPGym: Benchmarking Partially Observable Reinforcement Learning. In International Conference on Learning Representations (ICLR) Workshops, 2023. URL https://openreview.net/forum?id=ch Drut UTs0K.

Kevin P Murphy. A survey of pomdp solution techniques. environment, 2(10), 2000.

Tianwei Ni, Benjamin Eysenbach, and Ruslan Salakhutdinov. Recurrent model-free rl can be a strong baseline for many pomdps. ar Xiv preprint ar Xiv:2110.05038, 2021.

Samet Oymak and Mahdi Soltanolkotabi. Toward moderate overparameterization: Global convergence guarantees for training shallow neural networks. IEEE Journal on Selected Areas in Information Theory, 1(1):84 105, 2020.

Shai Shalev-Shwartz and Shai Ben-David. Understanding machine learning: From theory to algorithms. Cambridge university press, 2014.

Satinder P Singh, Tommi Jaakkola, and Michael I Jordan. Learning without state-estimation in partially observable markovian decision processes. In Machine Learning Proceedings 1994, pp. 284 292. Elsevier, 1994.

Nathan Srebro, Jason Rennie, and Tommi Jaakkola. Maximum-margin matrix factorization. Advances in neural information processing systems, 17, 2004.

Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research, 15(1): 1929 1958, 2014.

Richard S Sutton. Learning to predict by the methods of temporal differences. Machine learning, 3:9 44, 1988.

Matus Telgarsky. Deep learning theory lecture notes. https://mjt.cs.illinois.edu/dlt/, 2021. Version: 2021-10-27 v0.0-e7150f2d (alpha).

Masatoshi Uehara, Ayush Sekhari, Jason D Lee, Nathan Kallus, and Wen Sun. Provably efficient reinforcement learning in partially observable dynamical systems. Advances in Neural Information Processing Systems, 35:578 592, 2022.

Masatoshi Uehara, Ayush Sekhari, Jason D Lee, Nathan Kallus, and Wen Sun. Computationally efficient pac rl in pomdps with latent determinism and conditional embeddings. In International Conference on Machine Learning, pp. 34615 34641. PMLR, 2023.

Lingxiao Wang, Qi Cai, Zhuoran Yang, and Zhaoran Wang. Neural policy gradient methods: Global optimality and rates of convergence. ar Xiv preprint ar Xiv:1909.01150, 2019.

Published in Transactions on Machine Learning Research (10/2025)

Steven D Whitehead and Long-Ji Lin. Reinforcement learning of non-markov decision processes. Artificial intelligence, 73(1-2):271 306, 1995.

Daan Wierstra, Alexander Förster, Jan Peters, and Jürgen Schmidhuber. Recurrent policy gradients. Logic Journal of IGPL, 18(5):620 634, 2010.

Huizhen Yu. A function approximation approach to estimation of policy gradient for pomdp with structured policies. ar Xiv preprint ar Xiv:1207.1421, 2012.

Huizhen Yu and Dimitri P Bertsekas. On near optimality of the set of finite-state controllers for average cost pomdp. Mathematics of Operations Research, 33(1):1 11, 2008.

Nicolas Zucchet and Antonio Orvieto. Recurrent neural networks: vanishing and exploding gradients are not the end of the story. Advances in Neural Information Processing Systems, 37:139402 139443, 2024.

A Algorithmic Tools for Recurrent Neural Networks

A.1 Max-Norm Projection for Recurrent Neural Networks

Max-norm regularization, proposed by Srebro et al. (2004), has been shown to be very effective across a broad spectrum of deep learning problems (Srivastava et al., 2014; Goodfellow et al., 2013). In this work, we incorporate max-norm regularization (around the random initialization) into the recurrent natural policy gradient for sharp convergence guarantees. To that end, given an initialization (W(0), U(0), c) as in Definition 3.1 and a vector ρ = (ρw, ρu) R2 >0 of projection radii, we define the compactly-supported set of weights Ωρ,m Rm(d+1) as

Ωρ,m = n Θ Rm(d+1) : max i |Wii Wii(0)| ρw m, max i Ui Ui(0) ρu m

Given any symmetric random initialization (W(0), U(0), c) and ρ R2 >0, the set Ωρ,m is a compact and convex subset of Rm(d+1), and for any Θ Ωρ,m, we have

max 1 i m |Wii Wii(0)| ρw m,

max 1 i m Ui Ui(0) ρu m.

ProjΩρ,m[Θ] =

w B2 Wii(0), ρw m |Wii wi|, arg min

ui B2 Ui(0), ρu m Ui ui 2

As such, the projection operator ProjΩρ,m[ ] onto Ωρ,m is called the max-norm projection (or regularization).

Note that we have W W(0) 2 ρw, U U(0) 2 ρu and Θ Θ(0) 2 ρ 2 in the ℓ2 geometry for any Θ Ωρ,m. Therefore, although the max-norm parameter class Ωρ,m {Θ Rm(d+1) : Θ Θ(0) 2 ρ 2}, the ℓ2-projected Cai et al. (2019); Wang et al. (2019); Liu et al. (2019) and max-norm projected Cayci et al. (2024b) optimization algorithms recover exactly the same function class (i.e., RKHS associated with the neural tangent kernel studied in Ji et al. (2019); Telgarsky (2021), see Section 3.1).

B Proofs for Section 5

An important quantity in the analysis of recurrent neural networks is the following:

Γ(i) t ( zt; Θ) := Wii H(i) t ( zt; Θ),

for any hidden unit i [m] and Θ Rm(d+1). The following Lipschitzness and smoothness results for Θi 7 H(i) t ( zt; Θ) and Θi 7 Γ(i) t ( zt; Θ).

Published in Transactions on Machine Learning Research (10/2025)

Lemma B.1 (Local continuity of hidden states; Lemma 1-2 in Cayci & Eryilmaz (2025)). Given ρ R2 >0

and α 0, let αm = α + ρw m. Then, for any z (Y A) Z+ with supt N

2 1, t N and i [m],

Θi 7 H(i) t ( zt; Θ) is Lt-Lipschitz continuous with Lt = (ϱ2 0 + 1)ϱ2 1 p2 t(αmϱ1),

Θi 7 H(i) t ( zt; Θ) is βt-smooth with βt = O (d pt(αmϱ1) qt(αmϱ1)),

Θi 7 Γ(i) t ( zt; Θ) is Λt-Lipschitz with Λt =

2(ϱ0 + 1 + αm Lt),

Θi 7 Γ(i) t ( zt; Θ) is χt-smooth with χt =

2(Lt + αmβt),

in Ωρ,m. Consequently, for any Θ Ωρ,m,

sup z H max 0 t T |Ft( zt; Θ)| LT ρ 2, T N, (24)

sup z H |F Lin t ( zt; Θ) Ft( zt; Θ)| 2 m(ϱ2Λ2 t + ϱ1χt) Θ Θ(0) 2 2, t N, (25)

Ft( zt; Θ) Ft( zt; Θ(0)), Θ Θ 2β2 t ρ 2 2 m , (26)

with probability 1 over the symmetric random initialization (W(0), U(0), c) ζ0.

The following result builds on Proposition 3.8 in Cayci & Eryilmaz (2025), and identifies the approximation error for approximating f F by using randomly-initialized Ind RNNs of width m. Unlike the supervised learning setting in Cayci & Eryilmaz (2025), the approximation error in the RL setting is P µ,π T -norm.

Lemma B.2 (Approximation error between RNN-NTRF and RNN-NTK). Let f F with the transportation mapping v H , and let

Θi = Θi(0) + 1 mciv(Θi(0)), i [m]. (27)

for the initialization (W(0), U(0), c) ζ0 in Def. 3.1. Let

F Lin t ( ; Θ) = ΘFt( ; Θ(0)) (Θ Θ(0)).

If P π,µ T induces a compactly-supported marginal distribution for Xt, t N such that Xt 2 1 a.s. and { Zt : t N} is independent from the random initialization (W(0), U(0), c), then we have

E h Eπ µ h f t ( Zt) F Lin t ( Zt; Θ) 2ii 2 ν 2 2(1 + ϱ2 0)p2 t(αϱ1) m , (28)

where the outer expectation is with respect to the random initialization (W(0), U(0), c) ζ0.

Proof. For any hidden unit i [m], let

k=0 Wii k(0) H(i) t k 1( Zt k 1, Θi(0)) Xt k

j=0 It j( Zt j; Θi(0))

Then, it is straightforward to see that

F Lin t ( Zt; Θ) = 1

i=1 ζi, (29)

Published in Transactions on Machine Learning Research (10/2025)

and E[ζi| Zt] = E[f t ( Zt)| Zt] almost surely. Note that {ζi : i [m/2]} is independent and identically distributed and ζi = ζi+m/2 for any i [m/2]. Also, with probability 1 we have

|ζi| ( ) v(Θi(0)) 2

k=0 Wii k(0) H(i) t k 1( Zt k 1, Θi(0)) Xt k

j=0 It j( Zt j; Θi(0))

( ) v(Θi(0)) 2

k=0 αkϱk+1 1 q

( ) ν 2 ϱ1 q

1 + ϱ2 0 pt(αϱ1),

where ( ) follows from Cauchy-Schwarz inequality, ( ) follows from the uniform bound supz R |ϱ(z)| ϱ1 and almost-sure bounds Xk 2 1 and |Wii(0)| α, and ( ) follows from v Hν. From these bounds,

Var(ζi) E[Eπ µ[|ζi|2]] ν 2 2ϱ2 1(1 + ϱ0)2p2 t(αϱ1), i [m]. (30)

E h Eπ µ h f t ( Zt) F Lin t ( Zt; Θ) 2ii = Eπ µ

ζi E[ζi| Zt]

ζi E[ζi| Zt]

= 4 m2 Eπ µ

j=1 E ζi E[ζi| Zt] ζj E[ζj| Zt] ,

= 4 m2 Eπ µ

i=1 Var(ζi) 2

m ν 2 2ϱ2 1(1 + ϱ0)2p2 t(αϱ1),

where the first identity is from Fubini s theorem, the second identity is from the symmetricity of the random initialization, the fourth identity is due to the independent initialization for i m/2, and the inequality is from the bound in equation 30.

Proposition B.3 (Non-stationary Bellman equation). For π ΠNM, we have

Qπ t ( zt) = Eπ h r(St, At) + γQπ t+1( Zt+1) Zt = zt i = Eπ h r(St, At) + γVπ t+1(Zt+1) Zt = zt i ,

for any t Z+.

Proof of Theorem 5.4. Since {Qπ t : t N} F, let the point of attraction Θ be defined as in equation 27, and the potential function be defined as

Ψ(Θ) = Θ Θ 2 2. (31)

Then, from the non-expansivity of the projection operator onto the convex set Ωρ,m, we have the following inequality:

Ψ(Θ(k + 1)) Ψ(Θ(k)) + 2η

t=0 γtδt( Zk t+1; Θ(k)) Ft( Zk t ; Θ(k)), Θ(k) Θ + η2 ˇ RT ( Zk T ; Θ(k)) 2 2. (32)

Published in Transactions on Machine Learning Research (10/2025)

Let ˇEk t [ ] := E[ |Θ(k), . . . , Θ(0), Zk t ]. Then, we obtain

E[Ψ(Θ(k + 1)) Ψ(Θ(k))] 2ηE h T 1 X

t=0 γt ˇEk t [δt( Zk t+1; Θ(k))] Ft( Zk t ; Θ(k)), Θ(k) Θ

| {z } ( )t

+ η2E ˇ RT ( Zk T ; Θ(k)) 2 2 | {z } ( )

Bounding E( )t. By using the Bellman equation in the non-stationary setting (cf. Proposition B.3), notice that

ˇEk t δt( Zk t+1; Θ(k)) = ˇEk t [rk t + γFt+1( Zk t+1; Θ(k)] Ft( Zk t ; Θ(k)),

= γˇEk t Ft+1( Zk t+1; Θ(k)) Qπ t+1( Zk t+1) + Qπ t ( Zt) Ft( Zk t ; Θ(k)).

Secondly, we perform a change-of-feature as follows: Ft( Zk t ; Θ(k)), Θ(k) Θ = Ft( Zk t ; Θ(0)), Θ(k) Θ + err(1) t,k, (34)

err(1) t,k := Ft( Zk t ; Θ(k)) Ft( Zk t ; Θ(0)), Θ(k) Θ , and |err(1) t,k| 2β2 t ρ 2 2 m 2β2 T ρ 2 2 m ,

by Lemma B.1. Furthermore, Ft( Zk t ; Θ(0)), Θ(k) Θ = F Lin t ( Zk t ; Θ(k)) F Lin t ( Zk t ; Θ), (35)

= Ft( Zk t ; Θ(k)) Qπ t ( Zk t ) + err(2) t,k + err(3) t,k (36)

err(2) t,k := F Lin t ( Zk t ; Θ(k)) Ft( Zk t ; Θ(k)),

err(3) t,k := F Lin t ( Zk t ; Θ) + Qπ t ( Zk t ).

( )t = (Qπ t ( Zk t ) Ft( Zk t ; Θ(k)))2 + γˇEk t Ft+1( Zk t+1; Θ(k)) Qπ t+1( Zk t+1) (Qπ t ( Zk t ) Ft( Zk t ; Θ(k)))

+ ˇEk t δt( Zk t+1; Θ(k))

j=1 err(j) t,k.

By equation 24, we have sup z H |δt( zt+1; Θ(k))| r + 2LT ρ 2 =: δmax

Now, let ωt,k := E[(Qπ t ( Zk t ) Ft( Zk t ; Θ(k)))2] 1/2, where the expectation is over the joint distribution of Θ(k) and Zk T . Then,

E[( )t] ω2 t,k + γωt+1,kωt,k + δmax

j=1 E|err(j) t,k|.

From equation 25, we have

E|err(2) t,k| 2 m(ϱ2Λ2 T + ϱ1χT ) ρ 2 2.

From the approximation bound in Lemma B.2, we get

E|err(3) t,k| q

E|err(3) t,k|2 2 ν 2 p

1 + ϱ2 0 p T (αϱ1) m .

Published in Transactions on Machine Learning Research (10/2025)

Also, note that ωt+1,kωt,k 1

2(ω2 t,k + ω2 t+1,k). Putting these together, we obtain the following bound for every t {0, 1, . . . , T 1}:

E[( )t] ω2 t,k + γ

2 (ω2 t+1,k + ω2 t,k) + δmax CT m,

where CT := 2β2 T ρ 2 2 + 2(ϱ2Λ2 T + ϱ1χT ) ρ 2 2 + 2 ν 2 q

1 + ϱ2 0 p T (αϱ1).

Hence, we obtain the following upper bound:

t=0 γt E[( )t] (1 γ/2) X

t<T γtω2 t,k + δmax CT

(1 γ) m + 1 2

t<T γt+1ω2 t+1,k

t<T γtω2 t,k+γT ω2 T,k)

t<T γtω2 t,k + 1

2γT ω2 T,k + CT δmax

(1 γ) m. (37)

Bounding E[( )]. Using the triangle inequality, we obtain:

t<T γtδt( Zk t+1; Θ(k)) Ft( Zt; Θ(k)) 2 X

t<T γt|δt( Zk t+1; Θ(k))| Ft( Zt; Θ(k)) 2.

Since Θ(k) Ωρ,m for every k N as a consequence of the max-norm regularization, we have

|δt( Zk t+1; Θ(k))| δmax = r + 2LT ρ 2,

Ft( Zk t ; Θ(k)) 2 2 = 1

i=1 Θi H(i) t ( Zk t ; Θ(k)) 2 2 L2 t L2 T ,

for every t < T with probability 1 since Θi 7 H(i) t ( zt; Θi) is Lt-Lipschitz continuous by Lemma B.1. Hence, we obtain: ˇ RT ( Zk T ; Θ(k)) 2 δmax LT

Final step. Now, taking expectation over ( Zk t , Θ(k)) in equation 33, and substituting equation 37 and equation 38, we obtain:

E[Ψ(Θ(k + 1)) Ψ(Θ(k))] η(1 γ)

t=0 γtω2 t,k + ηγT ω2 T,k + η δmax CT

(1 γ) m + η2 δ2 max L2 T (1 γ)2 ,

for every k N. Note that Ψ(Θ(0)) ν 2 2. Thus, telescoping sum over k = 0, 1, . . . , K 1 yields

k=0 RT (Θ(k)) ν 2 2 η(1 γ)K + ηδ2 max L2 T (1 γ)3 + δmax CT (1 γ)2 m + γT

k=0 ω2 T,k. (39)

The final inequality in the proof stems from the linearization result Lemma B.2, and directly follows from

k<K RT (Θ(k)) + 6 m

ϱ2Λ2 T + ϱ1χT ρ 2 2,

which directly follows from Cayci & Eryilmaz (2025), Corollary 1.

In the following, we study the error under mean-path Rec-TD learning algorithm.

Published in Transactions on Machine Learning Research (10/2025)

Theorem B.4 (Finite-time bounds for mean-path Rec-TD). For K N, with the step-size choice η = (1 γ)2

64L2 T , mean-path Rec-TD learning achieves the following error bound:

k<K Rπ T (Θ(k))

2 ν 2 2 (1 γ)ηK + γT ωT,k

1 γ + CT δmax (1 γ)2 m + η (C T )2

m + 16γ2T L4 T ( ρ 2 2 + ν 2 2) ,

where C T and LT are terms that do not depend on K.

Theorem B.4 indicates that if a noiseless semi-gradient is used in Rec-TD, then the rate can be improved from O 1

K , indicating the potential limits of using variance-reduction schemes.

Proof of Theorem B.4. At any iteration k N, let

RT (Θ(k)) := Eπ µ h ˇ R( Zk t ; Θ(k)) i , (40)

be the mean-path semi-gradient. First, note that

RT (Θ(k)) 2 2 2 RT (Θ(k)) RT ( Θ) 2 2 + 2 RT ( Θ) 2 2. (41)

Bounding RT ( Θ) 2 2. For any k N, t T, we have

E δt( Zk t+1; Θ)| Zk t , Θ(0), c = γE[Ft+1( Zk t+1; Θ) Qπ t+1( Zk t+1)| Zk t , Θ(0), c] + Qπ t ( Zk t ) Ft( Zk t ; Θ).

Since Ft( zt; Θ) 2 Lt, the following inequality holds: E δt( Zk t+1; Θ) Ft( Zk t ; Θ) 2 E E δt( Zk t+1; Θ)| Zk t , Θ(0), c Ft( Zk t ; Θ) 2 ,

LT E E δt( Zk t+1; Θ)| Zk t , Θ(0), c ,

LT γE Ft+1( Zk t+1; Θ) Qπ t+1( Zk t+1) + E Qπ t ( Zk t ) Ft( Zk t ; Θ) , (42)

where we used Jensen s inequality, the law of iterated expectations, and triangle inequality. From the above inequality, we obtain

t=0 γt E δt( Zk t+1; Θ) Ft( Zk t ; Θ) 2,

t<T γt E|Ft+1( Zk t+1; Θ) Qπ t+1( Zk t+1)| + LT X

t<T γt E|Qπ t ( Zk t ) Ft( Zk t ; Θ)|,

t<T γt|Ft+1( Zk t+1; Θ) Qπ t+1( Zk t+1)|2 + E s X

t<T γt|Ft( Zk t ; Θ) Qπ t ( Zk t )|2

t<T γt|Ft+1( Zk t+1; Θ) Qπ t+1( Zk t+1)|2 + s

t<T γt|Ft( Zk t ; Θ) Qπ t ( Zk t )|2

2(1 + γ)LT 1 γ ν 2 p

1 + ϱ2 0 p T (ϱ1α) m .

where 1 follows from triangle inequality, 2 follows from equation 42, 3 follows from Cauchy-Schwarz inequality and the monotonicity of the geometric series T 7 P

t<T γt, 4 follows from Jensen s inequality, and finally 5 follows from Lemma B.2. Hence, we obtain

RT ( Θ) 2 2 8L2 T ν 2 2(1 + ϱ2 0)p2 T (ϱ1α) (1 γ)m . (43)

Published in Transactions on Machine Learning Research (10/2025)

Bounding RT (Θ(k)) RT ( Θ) 2 2. First, note that

RT (Θ(k)) RT ( Θ) 2 = E h X

t<T γt δt( Zk t+1; Θ(k)) Ft( Zk t ; Θ(k)) δt( Zk t+1; Θ) Ft( Zk t ; Θ) i 2

We make the following decomposition for each t < T:

δt( Zk t+1; Θ(k)) Ft( Zk t ; Θ(k)) δt( Zk t+1; Θ) Ft( Zk t ; Θ) = δt( Zk t+1; Θ(k)) Ft( Zk t ; Θ(k)) Ft( Zk t ; Θ)

+ Ft( Zk t ; Θ(k)) δt( Zk t+1; Θ) δt( Zk t+1; Θ(k)) (44)

By Lemma B.1, we have |δt( Zk t+1; Θ)| δmax and Ft( Zk t ; Θ) 1 Lt LT almost surely for any Θ Ωρ,m, which holds for Θ(k) (due to the max-norm projection) and Θ. As such, by triangle inequality,

RT (Θ(k)) RT ( Θ) 2 X

t<T γt δmax β2 t E Θ(k) Θ 2 2 m + Lt E|δt( Zk t+1; Θ) δt( Zk t+1; Θ(k))| ,

δmaxβ2 T ( ρ 2 2 + ν 2 2) m(1 γ) | {z }

=: C(4) T m

t=0 γt|δt( Zk t+1; Θ) δt( Zk t+1; Θ(k))|

Note that X

t<T γt|δt( Zk t+1; Θ(k)) δt( Zk t+1; Θ)| = X

t<T γt |Ft+1( Zk t+1; Θ) Ft+1( Zk t+1; Θ(k))| + |Ft( Zk t ; Θ) Ft( Zk t ; Θ(k))| ,

t<T γt Ft( Zk t ; Θ) Ft( Zk t ; Θ(k)) + γT LT Θ(k) Θ 2, (46)

where the second line follows from the Lipschitz continuity of Θ 7 Ft( ; Θ). Then, adding and subtracting Qπ t to each term, we obtain X

t<T γt|δt( Zk t+1; Θ(k)) δt( Zk t+1; Θ)| 2 X

t<T γt |Ft( Zk t ; Θ) Qπ t ( Zk t )| + |Qπ t ( Zk t ) Ft( Zk t ; Θ(k))|

+ γT LT Θ(k) Θ 2. (47)

Taking expectation, we obtain

t<T γt|δt( Zk t+1; Θ(k)) δt( Zk t+1; Θ)| 2 1 γ

t<T γt|Ft( Zk t ; Θ(k)) Qπ t ( Zk t )|2 #

t<T γt|Ft( Zk t ; Θ) Qπ t ( Zk t )|2 #

+ γT LT Θ(k) Θ 2.

By Lemma B.2 and equation 25, we have

E|Ft( Zk t ; Θ) Qπ t ( Zk t )|2 4

m ν 2 2ϱ2 1(1 + ϱ0)2p2 t(αϱ1) + 4

m(ϱ2Λ2 T + ϱ1χT )2 ρ 4 2,

for any t < T. Thus,

t<T γt|δt( Zk t+1; Θ(k)) δt( Zk t+1; Θ)| 2 1 γ

t<T γt|Ft( Zk t ; Θ(k)) Qπ t ( Zk t )|2 #

(1 γ)3 ν 2ϱ1(1 + ϱ0)p T (αϱ1) + (ϱ2Λ2 T + ϱ1χT ) ρ 2 2)

+γT LT Θ(k) Θ 2 | {z } ρ 2+ ν 2

Published in Transactions on Machine Learning Research (10/2025)

This results in the following bound:

γt|δt( Zk t+1; Θ(k)) δt( Zk t+1; Θ)| 2 1 γ

RT (Θ(k)) + C(3) T m + γT LT ( ρ 2 + ν 2). (48)

Substituting the local smoothness result in equation 48 into equation 45, we obtain

RT (Θ(k)) RT ( Θ) 2 LT

RT (Θ(k)) + C(3) T m + γT LT ( ρ 2 + ν 2)

+ C(4) T m .

Thus, we obtain

RT (Θ(k)) RT ( Θ) 2 2 16L2 T 1 γ RT (Θ(k)) + 4(C(3) T )2L2 T + 4(C(4) T )2

m + 8γ2T L4 T ( ρ 2 2 + ν 2 2). (49)

Using equation 43 and equation 49 together, we obtain

RT (Θ(k)) 2 2 2 RT (Θ(k)) RT ( Θ) 2 2 + 2 RT ( Θ) 2 2,

32L2 T RT (Θ(k))

1 γ + (C T )2

m + 16γ2T L4 T ( ρ 2 2 + ν 2 2). (50)

In the final step, we use equation 33, equation 37 and equation 50 together:

E [Ψ(Θ(k + 1)) Ψ(Θ(k))] η(1 γ)ERT (Θ(k)) + ηγT ωT,k + η CT δmax (1 γ) m

+ η2 32L2 T ERT (Θ(k))

1 γ + (C T )2

m + 16γ2T L4 T ( ρ 2 2 + ν 2 2) , (51)

where the expectation is over the random initialization. Choosing η = (1 γ)2

64L2 T , we obtain

E[Ψ(Θ(k + 1)) Ψ(Θ(k))] η(1 γ)

2 ERT (Θ(k)) + ηγT ωT,k + η CT δmax (1 γ) m

+ η2 (C T )2

m + 16γ2T L4 T ( ρ 2 2 + ν 2 2) . (52)

Telescoping sum over k = 0, 1, . . . , K 1, and re-arranging terms, we obtain:

k<K RT (Θ(k))

2 ν 2 2 (1 γ)ηK + γT ωT,k

1 γ + CT δmax (1 γ)2 m +η (C T )2

m + 16γ2T L4 T ( ρ 2 2 + ν 2 2) . (53)

C Numerical Experiments for Rec-TD

In the following, we will demonstrate the numerical performance of Rec-TD for a given non-stationary policy πgreedy.

POMDP setting. We consider a randomly-generated finite POMDP instance with |S| = |Y| = 8, |A| = 4, r(s, a) Unif[0, 1] for all (s, a) S A. For a fixed ambient dimension d = 8, we use a random feature mapping (y, a) 7 φ(y, a) N(0, Id), (y, a) Y A.

ϵ-greedy policy. Let j (t) arg max 0 j<t rj,

Published in Transactions on Machine Learning Research (10/2025)

be the instance before t at which the maximum reward was obtained, and let

πϵ greedy t (a|Zt) =

( 1 |A|, w.p. min{ 2+t

10 , pexp},

1a=Aj (t), w.p. 1 min{ 2+t

10 , pexp}, (54)

be the greedy policy with a user-specified exploration probability pexp (0, 1). The long-term dependencies in this greedy policy are obviously controlled by pexp: a small exploration probability will make the policy (thus, the corresponding Q-functions) more history-dependent. Since the exact computation of (Qπ t )t N is highly intractable for POMDPs, we use (empirical) mean-squared temporal difference (MSTD) 2 as a surrogate loss.

Example 1 (Short-term memory). We first consider the performance of Rec-TD with learning rate η = 0.05, discount factor γ = 0.9 and RNNs with various choices of network width m. For pexp = 0.8, the performance of Rec-TD is demonstrated in Figure 2. Consistent with the theoretical results in Theorem

(a) Mean-squared TD, T = 8

i Ui(k) Ui(0) , T = 8.

i |Wii(k) Wii(0)|, T = 8.

Figure 2: Mean-squared TD and (mean) parameter deviation under Rec-TD for the case pexp = 0.8 and γ = 0.9. The mean curve and confidence intervals (90%) stem from 5 trials.

5.4, Rec-TD (1) achieves smaller error with larger network width m, (2) requires smaller deviation from the random initialization Θ(0), which is known as the lazy training phenomenon.

Example 2 (Long-term memory). In the second example, we consider the same POMDP with same random samples, and an RNN with the same neural network initialization. The exploration probability is reduced to pexp = 0.25, which leads to longer dependency on the history. This impact can be observed in Figure 3c, which implies a larger spectral radius compared to Example 1 (in comparison with Figure 2c).

(a) Mean-squared TD, T = 8

i Ui(k) Ui(0) , T = 8.

i |Wii(k) Wii(0)|, T = 8.

Figure 3: Mean-squared TD and (mean) parameter deviation under Rec-TD for the case pexp = 0.25 and γ = 0.9. The mean curve and confidence intervals (90%) stem from 5 trials.

In Figure 4, we investigate the impact of the truncation level T on the MSTD performance with pexp = 0.25, which implies long-term dependency, for an RNN with m = 256 units. Increasing T implies a larger MSTD due to long-term dependencies, validating the theoretical results.

2the empirical mean of independently sampled 1

s<k ˆ RTD T (Θ(s)) : k N where ˆ RTD T (Θ(k)) = PT 1 t=0 γtδ2 t ( Zk t ; Θ(k)).

Published in Transactions on Machine Learning Research (10/2025)

D Policy Gradients under Partial Observability

In this section, we will provide basic results for policy gradients under POMDPs, which is critical to develop the natural policy gradient method for POMDPs.

Proposition D.1. Let π ΠNM be an admissible policy, and let ZT P π ,µ T . Then, for any t < T, conditional distribution of St given Zt is independent of π . Furthermore, for any π ΠNM, the conditional distribution of r(St, At) + γVπ t+1(Zt+1) given Zt is independent of π .

Proof of Prop. D.1. Let the belief at time t N be defined as

bt(s) := P(St = s| Zt). (55)

For any non-stationary admissible policy π, the belief function is policy-independent. To see this, note that

P(St = st, Zt = zt) = X

(s0,...,st 1) St P(S0 = s0|Y0 = y)π0(a0|z0)

k=0 P(sk+1|sk, ak)ϕ(yk+1|sk+1)πk+1(ak+1|zk+1),

k=0 πk(ak|zk)

(s0,...,st 1) St P(S0 = s0|Y0 = y)

k=0 P(sk+1|sk, ak)ϕ(yk+1|sk+1),

since Qt k=0 πk(ak|zk) does not depend on the summands (s0, . . . , st 1) note that we use the notation P(sk+1|sk, ak) := P(sk, ak, {Sk+1 = sk+1}) and ϕ(yk|sk) := ϕ(sk, {Yk = yk}). Thus,

(s0,...,st 1) St P(S0 = s0|Y0 = y) Qt 1 k=0 P(sk+1|sk, ak)ϕ(yk+1|sk+1) P

(s 0,...,s t 1,s t) St+1 P(S0 = s 0|Y0 = y) Qt 1 k=0 P(s k+1|s k, ak)ϕ(yk+1|s k+1) ,

independent of π. As such, we have

Eπ [rt + γVπ(Zt+1)| Zt] = X

s S bt(s)Eπ [rt + γVπ t+1(Zt+1)| Zt = zt, St = s],

y Y bt(st) r(st, At) + γP(st+1|st, At)ϕ(y|st+1)Vπ t+1(Zt, yt+1) ,

= E[rt + γVπ t+1(Zt+1)| Zt = zt],

in other words, the conditional distribution of r(St, At) + γVπ t+1(Zt+1) given { Zt = zt} is independent of π . We also know from Prop. B.3 that

Eπ [rt + γVπ t+1(Zt+1)| Zt = zt] = E[rt + γVπ t+1(Zt+1)| Zt = zt] = Qπ t ( zt).

Figure 4: MSTD performance with m = 256 with various sequence lengths T with pexp = 0.25. Increasing T implies larger MSTD.

Published in Transactions on Machine Learning Research (10/2025)

The next result generalizes the policy gradient theorem to POMDPs. We note that there is an extension of REINFORCE-type policy gradient for POMDPs in Wierstra et al. (2010). The following result is a different and improved version as it 1 provides a variance-reduced unbiased estimate of the policy gradient for POMDPs, and more importantly 2 yields the compatible function approximation (Prop. 6.1) that yields natural policy gradient (NPG) for POMDPs.

Proposition D.2 (Policy gradient POMDPs). For any Φ Rm(d+1), we have

ΦVπΦ(µ) = EπΦ µ

t=0 γt QπΦ t (Zt, At) Φ ln πΦ t (At|Zt)

for any µ (Y).

Proof of Prop. D.2. For any t N, we have

VπΦ t (zt) = X

at πΦ t (at|zt)QπΦ t (zt, at), (57)

by Prop. B.3. Thus, we obtain

VπΦ t (zt) = X

at πΦ t (at|zt) ln πΦ t (at|zt)QπΦ t (zt, at) + X

at πΦ t (at|zt) QπΦ t (zt, at),

= EπΦ[ ln πΦ t (At|Zt)QπΦ t (Zt, At) + QπΦ t (Zt, At)|Zt = zt]. (58)

Now, note that

QπΦ t (zt, at) = E[r(St, At) + γVπΦ t+1(Zt+1)| Zt = (zt, at)],

r(st, at) + γ X

st+1 P(st+1|st, at) X

yt+1 ϕ(yt+1|st+1)VπΦ t+1(zt+1)

where zt+1 = (zt, at, yt+1). As a consequence of Prop. D.1, we have Φ P

st bt(st)r(st, at) = 0, and also

ΦQπΦ t (zt, at) = γ X

st bt(st) X

st+1 P(st+1|st, at) X

yt+1 ϕ(yt+1|st+1) ΦVπΦ t+1(zt+1),

= γE[ ln πΦ t+1(At+1|Zt+1)QπΦ t+1(Zt+1, At+1) + ΦQπΦ t+1(Zt+1, At+1)| Zt = (zt, at)],

k=t+1 γk t 1 Φ ln πΦ k (Ak|Zk)QπΦ k (Zk, Ak) Zt = (zt, at) i .

Using the above recursive formula for ΦQπΦ t along with the law of iterated expectations in equation 58, we obtain

ΦVπΦ t (zt) = EπΦh X

k=t γk t Φ ln πΦ k (Ak|Zk)QπΦ k (Zk, Ak) Zt = zt i . (59)

Since we have Vπ := Vπ 0 , and also ΦVπΦ(µ) = Φ P

z0 µ(z0)VπΦ(z0) = P

z0 µ(z0) ΦVπΦ(z0) by the linearity of gradient, we conclude the proof.

Note on the baseline. Similar to the case of fully-observable MDPs, adding a baseline qπΦ t (zt) to the Q-function does not change the policy gradients since P

a πt(a|zt) ln πΦ t (a|zt)qπΦ t (zt) = qπΦ t (zt) P

a πΦ t (a|zt) = qπΦ t (zt) P

a πΦ t (a|zt) = 0. Thus, we also have

ΦVπΦ(µ) = EπΦ µ

t=0 γt AπΦ t (Zt, At) Φ ln πΦ t (At|Zt)

which uses qπΦ t = VπΦ t as the baseline, akin to the fully-observable case.

Published in Transactions on Machine Learning Research (10/2025)

The following result extends the compatible function approximation theorem in Kakade (2001) to POMDPs.

Proof of Prop. 6.1. The proof is identical to Kakade (2001). By first-order condition for optimality, we have

t=0 γt ln πΦ t (At|Zt) ln πΦ t (At|Zt)ω AπΦ t ( Zt) = 2 Gµ(Φ)ω ΦVπΦ(µ) = 0,

which concludes the proof.

E Theoretical Analysis of Rec-NPG

First, we prove structural results for RNNs in the kernel regime, which will be key in the analysis later.

E.1 Log-Linearization of Soft Max Policies Parameterized by RNNs

The key idea behind the neural tangent kernel (NTK) analysis is linearization around the random initialization. To that end, let F Lin t ( zt; Θ) := Ft( zt; Θ(0)), Θ Θ(0) , (61)

for any Θ Rm(d+1). We define the log-linearized policy as follows:

πΦ t (a|zt) := exp(F Lin t (zt, a; Φ)) P

a A exp(F Lin t (zt, a ; Φ)), t N. (62)

The first result bounds the Kullback-Leibler divergence between πΦ t and its log-linearized version πΦ t . In the case of FNNs with Re LU activation functions, a similar result was presented in Cayci et al. (2024b). The following result extends this idea to (i) RNNs, and (ii) smooth activation functions.

Proposition E.1 (Log-linearization error). For any t N and (zt, a) (Y A)t+1, we have

sup (zt,a) (Y A)t+1

ln πΦ t (a|zt) πΦ t (a|zt)

Λ2 tϱ2 + χtϱ1 Φ Φ(0) 2 2, (63)

for any t N. Consequently, we have πt( |zt) πt( |zt) and πt( |zt) πt( |zt), and

max DKL(πΦ t ( |zt) πΦ t ( |zt)), DKL( πΦ t ( |zt) πΦ t ( |zt)) 6 m

Λ2 tϱ2 + χtϱ1 Φ Φ(0) 2 2, (64)

for all zt (Y A)t+1 and t N.

Proof. Fix (zt, a) (Y A)t+1. By the log-sum inequality Cover & Thomas (2006), we have

a exp(F Lin t (zt, a; Φ)) P

a exp(Ft(zt, a; Φ)) X

a A πΦ t (a|zt) F Lin t (zt, a; Φ) Ft(zt, a; Φ) .

Using the same argument, we obtain ln P

a exp(F Lin t (zt, a; Φ)) P

a exp(Ft(zt, a; Φ))

πΦ t (a|zt) + πΦ t (a|zt) F Lin t (zt, a; Φ) Ft(zt, a; Φ) . (65)

Thus, we have ln πΦ t (a|zt) πΦ t (a|zt)

(1 + πΦ t (a|zt) + πΦ t (a|zt)) F Lin t (zt, a; Φ) Ft(zt, a; Φ) .

Published in Transactions on Machine Learning Research (10/2025)

By using Lemma B.1, we have sup zt (Y A)t+1 F Lin t ( z t; Φ) Ft( z t; Φ) 2 m(Λ2 tϱ2 + χtϱ1) Φ Φ(0) 2 2. By using the last two inequalities together, and noting that 1 + πΦ t (a|zt) + πΦ t (a|zt) 3, we conclude that ln πΦ t (a|zt) πΦ t (a|zt)

6 m(Λ2 tϱ2 + χtϱ1) Φ Φ(0) 2 2.

Since the right-hand side of the above inequality is independent of (zt, a), we deduce that the result holds for all (zt, a), thus concluding the proof.

The following result will be important in establishing the Lyapunov drift analysis of Rec-NPG.

Proposition E.2 (Smoothness of ln πΦ t (a|zt)). For any t N, we have

sup (zt,a) (Y A)t+1 ln πΦ t (a|zt) ln πΦ t (a|zt) 2 L2 t Φ Φ 2,

for any Φ, Φ Rm(d+1).

Proof. Consider a general log-linear parameterization

pθ(x) exp(ϕ x θ), x X.

Then, if supx X ϕx 2 B < , then θ 7 ln pθ(x) has B2-Lipschitz continuous gradients for each x X Agarwal et al. (2020). The remaining part is to prove a uniform upper bound for ΦFt( zt; Φ(0)) 2. To that end, notice that

Φi Ft( zt; Φ(0)) = 1 mci H(i) t ( zt; Φ(0)), zt (Y A)t+1, i [m].

From the local Lipschitz continuity result in Lemma B.1, we have sup zt:maxj t (yj,aj) 2 1 Φi H(i) t ( zt; Φ(0)) 2 Lt for any i [m]. Thus, for any zt, we have

ΦFt( zt; Φ(0)) 2 2 = 1

i=1 Φi H(i) t ( zt; Φ(0)) 2 2 L2 t. (66)

E.2 Theoretical Analysis of Rec-NPG

For any π ΠNM, we define the potential function as

L (π) := Eπ µ

t=0 γt DKL (π t ( |Zt) πt( |Zt))

Then, we have the following drift inequality.

Proposition E.3 (Drift inequality). For any n N, the drift can be bounded as follows:

L (πΦ(n+1)) L (πΦ(n)) ηnpg(Vπ (µ) VπΦ(n)(µ)) ηnpg Eπ µ

t=0 γt ln πΦ(n) t (At|Zt)ωn AπΦ(n) t ( Zt) #

+ ηnpg Eπ µ

t=T γt AπΦ(n) t ( Zt)

t=0 γt ln πΦ(n) t (At|Zt) ln πΦ(n) t (At|Zt) ωn

2η2 npg ρ 2 2

t=0 γt L2 t + 12 ρ 2 2 m

t=0 γt(Λ2 tϱ2 + χtϱ1).

Published in Transactions on Machine Learning Research (10/2025)

Proof. First, note that the drift can be expressed as

L (πΦ(n+1)) L (πΦ(n)) = Eπ µ

a A π t (At|Zt) ln πΦ(n) t (At|Zt)

πΦ(n+1) t (At|Zt) .

Then, with a log-linear transformation,

L (πΦ(n+1)) L (πΦ(n)) = Eπ µ

a A π t (At|Zt)

ln πΦ(n) t (At|Zt)

πΦ(n+1) t (At|Zt) + ln πΦ(n) t (At|Zt)

πΦ(n) t (At|Zt) + ln πΦ(n+1) t (At|Zt)

πΦ(n+1) t (At|Zt)

By using the log-linearization bound in Prop. E.1 twice in the above inequality, we obtain

L (πΦ(n+1)) L (πΦ(n)) Eπ µ

a A π t (At|Zt) ln πΦ(n) t (At|Zt)

πΦ(n+1) t (At|Zt) + 12 m

t=0 γt(Λ2 tϱ2 +χtϱ1) ρ 2 2. (68)

By the smoothness result in Prop. E.2, we have

| ln πΦ(n+1) t (at|zt) ln πΦ(n) t (at|zt) ln πΦ(n) t (at|zt)(Φ(n + 1) Φ(n))| 1

2L4 t Φ(n + 1) Φ(n) 2 2.

Thus, we obtain

η2 npg L4 t ρ 2 2 η2 npg L4 t ωn 2 2 ln πΦ(n) t (at|zt)

πΦ(n+1) t (at|zt) ηnpg ln πΦ(n) t (at|zt)ωn,

because of the max-norm gradient clipping that yields ωn 2 ρ 2 and Φ(n + 1) = Φ(n) + ηnpgωn for any n N. Using this in equation 68, we get

L (πΦ(n+1)) L (πΦ(n)) ηnpg Eπ µ

t=0 γt ln πΦ(n) t (at|zt)ωn+ 12 m

t=0 γt(Λ2 tϱ2+χtϱ1) ρ 2 2+1

2η2 npg L4 t ρ 2 2.

(69) An important technical result that will be useful in our analysis is the pathwise performance difference lemma, which was originally developed in Kakade & Langford (2002) for fully-observable MDPs.

Lemma E.4 (Pathwise Performance Difference Lemma). Let Φ, Φ Rm(d+1) be two parameters. Then, we have

VπΦ (µ) VπΦ(µ) = EπΦ

t=0 γt AπΦ t (Zt, At).

The proof of Lemma E.4 is an extension of Agarwal et al. (2020) to non-stationary policies, and can be found at the end of this subsection.

Using Lemma E.4 in equation 69, we obtain

L (πΦ(n+1)) L (πΦ(n)) ηnpg(Vπ (µ) VπΦ(n)(µ)) ηnpg Eπ µ

t=0 γt ln πΦ(n) t (at|zt)ωn AπΦ(n) t ( Zt)

+ ηnpg Eπ µ

t=T AπΦ(n) t ( Zt) + 12 m

t=0 γt(Λ2 tϱ2 + χtϱ1) ρ 2 2 + 1

2η2 npg L4 t ρ 2 2. (70)

Finally, we replace the term ln πΦ(n) t (at|zt) with ln πΦ(n) t (at|zt) by including the corresponding error term, and conclude the proof by considering the telescoping sum, and noting that L (πΦ(0)) = log |A| since Ft( ; Φ(0)) = 0 by symmetric initialization.

Published in Transactions on Machine Learning Research (10/2025)

Proof of Theorem 6.3. We prove Theorem 6.3 by bounding the numbered terms in Prop. E.3.

Bounding 1 in Prop. E.3. Recall that p T (γ) = P

t<T γt. Then, by using Jensen s inequality,

t=0 γt ln πΦ(n) t (At|Zt)ωn AπΦ(n) t ( Zt)

v u u tp T (γ)Eπ µ

t=0 γt ln πΦ(n) t (At|Zt)ωn AπΦ(n) t ( Zt) 2 ,

κεT cfa(Φ(n), ωn),

where κ yields a change-of-measure argument from P π ,µ T to P πΦ(n),µ T .

Bounding 2 in Prop. E.3. sups,a |r(s, a)| r , therefore |Aπ t ( zt)| 2r

1 γ for any t N, zt (Y A)t+1, and π ΠNM.

Bounding 3 in Prop. E.3. For any t N, Cauchy-Schwarz inequality implies

ln πΦ(n) t (at|zt) ln πΦ(n) t (at|zt) ωn ln πΦ(n) t (at|zt) ln πΦ(n) t (at|zt) 2 ρ 2.

Recall that

ln πΦ t (at|zt) = Ft(zt, at; Φ(0)) X

a πΦ t (a |zt) Ft(zt, a ; Φ(0)),

ln πΦ t (at|zt) = Ft(zt, at; Φ) X

a πΦ t (a |zt) Ft(zt, a ; Φ).

First, from local βt-Lipschitzness of Φi 7 H(i) t ( zt; Φi) for Φ Ωρ,m by Lemma B.1, we have

Φi Ft( zt; Φ(n)) Φi Ft( zt; Φ(0)) 2 = 1 m Φi H(i) t ( zt; Φi(n)) Φi H(i) t ( zt; Φi(0)) 2,

for any n N since maxi Φi(n) Φi(0) 2 ρ 2 m by max-norm projection. Thus,

ΦFt( zt; Φ(n)) ΦFt( zt; Φ(0)) 2 βt ρ 2 m , t N. (71)

ln πΦ(n) t (at|zt) ln πΦ(n) t (at|zt) 2 βt ρ 2 m + X

a |πΦ(n) t (a|zt) πΦ(n) t (a|zt)| Ft( zt; Φ(0)) 2

a πΦ(n) t (a|zt) Ft(zt, a; Φ(n)) Ft(zt, a; Φ(0)) 2.

From equation 66, we have

ln πΦ(n) t (at|zt) ln πΦ(n) t (at|zt) 2 2βt ρ 2 m + 2Lt DTV πΦ(n) t ( |zt) πΦ(n) t ( |zt) ,

where DTV denotes the total-variation distance between two probability measures. By Pinsker s inequality Cover & Thomas (2006), we obtain

ln πΦ(n) t (at|zt) ln πΦ(n) t (at|zt) 2 2βt ρ 2 m +

DKL πΦ(n) t ( |zt) πΦ(n) t ( |zt) . (72)

Published in Transactions on Machine Learning Research (10/2025)

By the log-linearization result in Prop. E.1, we have

ln πΦ(n) t (at|zt) ln πΦ(n) t (at|zt) 2 2βt ρ 2 m +

Λ2 tϱ2 + χtϱ1 m . (73)

Thus, we have

ln πΦ(n) t (at|zt) ln πΦ(n) t (at|zt) ωn ρ 2 2

Λtϱ2 + χtϱ1

Proof of Lemma E.4. For any y0 Y, we have:

Vπ (y0) Vπ(y0) = Eπ µ h X

t=0 γtrt Z0 = y0 i Vπ(y0),

t=0 γt rt + Vπ t (Zt) Vπ t (Zt) Z0 = y0 i Vπ(y0),

t=0 γt(rt + γVπ t+1(Zt+1) Vπ t (Zt) Z0 = y0 i ,

where rt = r(St, At) and the last identity holds since

t=0 γt Vπ t (zt) = Vπ 0 (z0) + γ

t=0 γt Vπ t+1(zt+1).

Then, letting rt = r(st, at) and by using law of iterated expectations,

Vπ (y0) Vπ(y0) = Eπ µ h X

t=0 γt Eπ [rt + γVπ t+1(Zt+1)| Zt, St] Vπ t (Zt) Z0 = y0 i , (74)

which holds because Eπ [rt + γVπ(Zt+1)| Zt] = Eπ [rt + γVπ(Zt+1)| Zt, Z0].

The conditional expectation of rt + γVπ t+1 given { Zt = zt} is independent of π :

Eπ [rt + γVπ(Zt+1)| Zt] = X

s S bt(s)Eπ [rt + γVπ t+1(Zt+1)| Zt = zt, St = s],

y Y bt(st) r(st, At) + γP(st+1|st, At)ϕ(y|st+1)Vπ t+1(Zt, yt+1) ,

= E[rt + γVπ t+1(Zt+1)| Zt = zt],

based on Prop. D.1. We also know from Prop. B.3 that

Eπ [rt + γVπ t+1(Zt+1)| Zt = zt] = E[rt + γVπ t+1(Zt+1)| Zt = zt] = Qπ t ( zt).

Using the above identity in equation 74, we obtain

Vπ (y0) Vπ(y0) = Eπ µ h X

t=0 γt Qπ t ( Zt) Vπ(Zt) Z0 = y0 i , (75)

which concludes the proof.

Published in Transactions on Machine Learning Research (10/2025)

Proof of Prop. 6.6. For any ω, we have

ℓT (ω; Φ(n), QπΦ(n)) 2ℓT (ω; Φ(n), ˆQ(n)) + 2

t=0 γt(AπΦ(n) t (Zt, At) ˆ A(n) t (Zt, At))2. (76)

Let Gn := σ(Φ(k), k n) and Hn := σ( Θ(n), Φ(k), k n). Then, since

εsgd,n = E[ℓT (ωn; Φ(n), ˆQ(n))|Hn] inf ω B(m) 2, (0,ρ) E[ℓT (ω; Φ(n), ˆQ(n))|Hn],

E[ℓT (ωn; Φ(n), QπΦ(n))|Hn] 2E h inf ω E[ℓT (ω; Φ(n), ˆQ(n))|Hn] Gn i + 2(εtd,n + εsgd,n), (77)

which uses the fact that V ar(X|Gn) E[|X|2|Gn] for any square-integrable X. We also have

inf ω E[ℓT (ω; Φ(n), ˆQ(n))|Hn] 2 inf ω E[ℓT (ω; Φ(n), QπΦ(n))|Hn]+2

t=0 γt(AπΦ(n) t (Zt, At) ˆ A(n) t (Zt, At))2, (78)

which further implies that

E[inf ω E[ℓT (ω; Φ(n), ˆQ(n))|Hn]|Gn] 2E[inf ω E[ℓT (ω; Φ(n), QπΦ(n))|Hn]|Gn] + 2εtd,n.

Thus, E[ℓT (ωn; Φ(n), QπΦ(n))|Hn] 4E h inf ω E[ℓT (ω; Φ(n), QπΦ(n))|Hn] Gn i + 6εtd,n + 2εsgd,n. (79)

For any ω B(m) 2, (0, ρ),

E[ℓT (ω; Φ(n), QπΦ(n))|Hn] E[ X

t<T γt( ΦFt( Zt; Φ(n))ω QπΦ(n) t ( Zt))2|Hn],

t<T γt( ΦFt( Zt; Φ(0))ω QπΦ(n) t ( Zt))2 + ( Ft( Zt; Φ(n)) Ft( Zt; Φ(0)) ω)2|Hn],

which implies that

inf ω E[ℓT (ω; Φ(n), QπΦ(n))|Hn] 2εapp,n + 2 ρ 2 2E[ X

t<T γt Ft( Zt; Φ(n)) Ft( Zt; Φ(0) 2 2|Hn],

2εapp,n + 2 ρ 4 2 m

t<T γtβ2 t ,

using equation 71. Hence,

E[ℓT (ωn; Φ(n), QπΦ(n))|Hn] 8 ρ 4 2 m

t<T γtβ2 t + 8εapp,n + 6εtd,n + 2εsgd,n,

concluding the proof.

Proof of Prop. 6.8. Under Assumption 6.7, consider f (j) t ( zt) := E[ψ t ( zt; ϕ0)v(j)(ϕ0)] for v(j) HJ ,ν. Let

ω(j) i := 1 mciv(j)(Φi(0)), i = 1, 2, . . . , m, (80)

for any j J . Since ω(j) 2 ν 2 and ρ ν, we have

inf ω B(m) 2, (0,ρ)

Ft( zt; Φ(0))ω f (j) t ( zt) Ft( zt; Φ(0))ω(j) f (j) t ( zt) . (81)

Published in Transactions on Machine Learning Research (10/2025)

Thus, we aim to find a uniform upper bound for the second term over j J . For each zt, we have

Ft( zt; Φ(0))ω(j) = 1

i=1 Φi H(i) t ( zt; Φi(0))v(j)(Φi(0)),

thus E[ Ft( zt; Φ(0))ω(j)] = f (j) t ( zt). Furthermore, from Lemma B.1, since Φ(0) Ωρ,m obviously, we have

max 1 i m Φi H(i) t ( zt; Φi(0))v(j)(Φi(0)) 2 Lt ν 2 Lt ρ 2, a.s..

Thus, by Mc Diarmid s inequality Mohri et al. (2018), we have with probability at least 1 δ,

Ft( zt; Φ(0))ω(j) f (j) t ( zt) 2Radm(G zt t ) + Lt ρ 2

for each t < T and zt. By union bound,

sup j J max zt

Ft( zt; Φ(0))ω(j) f (j) t ( zt) 2 max zt Radm(G zt t ) + Lt ρ 2

log(2T|Y A|t+1/δ)

2 max 0 t<T max zt Radm(G zt t ) + LT ρ 2

log(2T|Y A|T /δ)

simultaneously for all t < T with probability 1 δ. Therefore,

inf ω EπΦ(n) µ X

t<T γt| Ft( Zt; Φ(0))ω f (j) t |2 EπΦ(n) µ X

t<T γt sup j J | Ft( Zt; Φ(0))ω(j) f (j) t |2,

2 max 0 t<T max zt Radm(G zt t ) + LT ρ 2

log(2T|Y A|T /δ)