# recurrent_natural_policy_gradient_for_pomdps__4742eb71.pdf Published in Transactions on Machine Learning Research (10/2025) Recurrent Natural Policy Gradient for POMDPs Semih Cayci cayci@mathc.rwth-aachen.de Department of Mathematics RWTH Aachen University Atilla Eryilmaz eryilmaz.2@osu.edu Department of Electrical and Computer Engineering The Ohio State University Reviewed on Open Review: https: // openreview. net/ forum? id= 6G01e0vg If Solving partially observable Markov decision processes (POMDPs) remains a fundamental challenge in reinforcement learning (RL), primarily due to the curse of dimensionality induced by the non-stationarity of optimal policies. In this work, we study a natural actorcritic (NAC) algorithm that integrates recurrent neural network (RNN) architectures into a natural policy gradient (NPG) method and a temporal difference (TD) learning method. This framework leverages the representational capacity of RNNs to address non-stationarity in RL to solve POMDPs while retaining the statistical and computational efficiency of natural gradient methods in RL. We provide non-asymptotic theoretical guarantees for this method, including bounds on sample and iteration complexity to achieve global optimality up to function approximation. Additionally, we characterize pathological cases that stem from long-term dependencies, thereby explaining limitations of RNN-based policy optimization for POMDPs. 1 Introduction Reinforcement learning (RL) for partially observable Markov decision processes (POMDPs) has been a particularly challenging problem due to the absence of an optimal stationary policy, which leads to a curse of dimensionality as the space of non-stationary policies grows exponentially over time (Krishnamurthy, 2016; Murphy, 2000). To address this curse of dimensionality in solving POMDPs, finite-memory (Yu & Bertsekas, 2008; Yu, 2012; Kara & Yüksel, 2023; Cayci et al., 2024a) and RNN-based (Lin & Mitchell, 1993; Whitehead & Lin, 1995; Wierstra et al., 2010; Mnih et al., 2014; Ni et al., 2021; Lu et al., 2024) model-free RL approaches are widely used to solve POMDPs. Despite the empirical success of RNN-based model-free RL methods, a rigorous theoretical understanding of their performance in the POMDP setting remains limited. We begin by outlining two key observations that motivate our approach: Observation 1. Recurrent neural networks (RNNs) have been extensively employed in model-free reinforcement learning (RL) to solve partially observable Markov decision processes (POMDPs) (Whitehead & Lin, 1995; Wierstra et al., 2010; Mnih et al., 2014). Recent work Ni et al. (2021) demonstrates that RNN-based model-free RL can perform competitively with more sophisticated and structured approaches under appropriate hyperparameter and architecture choices. In Lu et al. (2024), shortcomings of emerging transformers in solving POMDPs were demonstrated, and it was shown, somewhat surprisingly, that particular recurrent architectures can achieve superior practical performance in certain scenarios. However, despite this plethora of works that demonstrate the effectiveness of RNN-based model-free algorithms for solving POMDPs, a concrete theoretical understanding of these methods is still in a nascent stage. This is particularly important since, as noted by Ni et al. (2021), RNN-based model-free RL algorithms are sensitive to optimization parameters, and identification of provably good choices is important for practice. Published in Transactions on Machine Learning Research (10/2025) Observation 2. Natural policy gradient (NPG) framework has been shown to be effective in solving MDPs due to its versatility in encompassing powerful function approximators, such as deep neural networks (Wang et al., 2019; Cayci et al., 2024b). However, a naïve application of such non-recurrent model-free RL algorithms to solve POMDPs has been observed to be ineffective (Ni et al., 2021), which necessitate careful incorporation of recurrent architectures into the policy optimization framework. This calls for the need to incorporate and analyze policy optimization, particularly NPG framework, augmented with recurrent architectures, to obtain a provably effective solution for POMDPs. Our study is motivated by these observations and guided by the following key questions, each addressed in this work: Q1. How can we achieve (i) provably effective and (ii) computation/memory-efficient policy evaluation for non-stationary policies in partially observable environments? A temporal difference (TD) learning algorithm with an Ind RNN (Rec-TD) overcomes the so-called perceptual aliasing problem imperative in memoryless TD learning for POMDPs (Singh et al., 1994), and achieves near-optimal policy evaluation, provided a sufficiently large network (Theorem 5.4 and Remark 5.5). Our analysis identifies the exploding semi-gradients pathology in policy evaluation, which can significantly increase network and iteration complexities to mitigate perceptual aliasing under long-term dependencies (Remark 5.6), and demonstrates the role of regularization to mitigate this. We also provide empirical results in random-POMDP instances in Appendix C. Q2. How can we parameterize non-stationary policies by a rich and practically feasible class of RNNs and perform efficient policy optimization? We represent non-stationary policies using Ind RNNs with Soft Max parameterization as a form of finitestate controller, and perform computationally efficient NPG updates (based on path-based compatible function approximation for POMDPs) for policy optimization. The policy optimization update (called Rec-NPG) is aided by Rec-TD as the critic (Section 4). Q3. What are the memory, computation and sample complexities of the resulting Rec-NAC method, which employs Rec-NPG for policy updates and Rec-TD for policy evaluation? Our non-asymptotic analyses of Rec-TD (Theorem 5.4) and Rec-NPG (Theorem 6.3) demonstrate their near-optimality in the large-network limit while highlighting dependencies on memory, long-term POMDP dynamics, and RNN smoothness. Pathological cases with long-term dependencies may require exponentially growing resources (Remarks 5.6-6.4). These results establish principled and scalable RL solutions for POMDPs, offering insights into the interplay between memory, smoothness, and optimization complexity. 1.1 Previous work Natural policy gradient method, proposed by Kakade (2001), has been extensively investigated for MDPs (Agarwal et al., 2020; Cen et al., 2020; Khodadadian et al., 2021; Liu et al., 2020; Cayci et al., 2024c), and analyses of NPG with feedforward neural networks (FNNs) have been established by Wang et al. (2019); Liu et al. (2019); Cayci et al. (2024b). As these works consider MDPs, the policies are stationary. In our case, the analysis of RNNs and POMDPs constitute a very significant challenge. Standard TD learning, which does not have a memory structure, was shown to be suboptimal for POMDPs (Singh et al., 1994). We incorporate RNNs into TD learning as a form of memory to address this problem in this work. In Yu (2012); Singh et al. (1994); Uehara et al. (2022); Kara & Yüksel (2023); Cayci et al. (2024a), finitememory policies based on sliding-window approximations of the history were investigated. Bilinear frameworks with memory-based policies (Uehara et al., 2022) and Hilbert space embeddings with deterministic Published in Transactions on Machine Learning Research (10/2025) latent dynamics (Uehara et al., 2023) enable sample-efficient learning under specific model structures. In Guo et al. (2022), an offline RL algorithm for the specific class of linear POMDPs was proposed. Unlike these existing works, our approach integrates RNNs with NAC methods, providing a scalable and theoretically grounded framework for general POMDPs without requiring structural assumptions such as deterministic transitions, fixed memory windows, or linear POMDP dynamics. Valueand policy-based model-free RL algorithms based on RNNs have been widely considered in practice to solve POMDPs (Lin & Mitchell, 1993; Whitehead & Lin, 1995; Wierstra et al., 2010; Mnih et al., 2014; Ni et al., 2021; Lu et al., 2024). However, these works are predominantly experimental, thus there is no theoretical analysis of RNN-based RL methods for POMDPs to the best of our knowledge. In this work, we also present theoretical guarantees for RNNbased NPG for POMDPs. For structural results on the hardness of RL for POMDPs, we refer to (Liu et al., 2022; Singh et al., 1994). 1.2 Notation For a finite set A, (A) = {v R|A| 0 : P a A va = 1} is the set of probability vectors over the set A. Rad(α) = Unif{ α, α} for α R+. 2 Preliminaries on Partially Observable Markov Decision Processes In this paper, we consider a discrete-time infinite-horizon partially observable Markov decision process (POMDP) with the (nonlinear) dynamics P (St+1 = s|Sk, Ak, k t) =: P((St, At), s), P(Yt = y|St) =: ϕ(St, y), for any s S and y Y, where St is an S-valued state, Yt is a Y-valued observation, and At is an A-valued control process with the stochastic kernels P : S A S [0, 1] and ϕ : S Y [0, 1]. We consider finite but arbitrarily large A, Y and S, where A Rd1, Y Rd2 for some d1, d2 Z+ with d := d1 + d2, and (y, a) 2 1 for any (y, a) Y A. In this setting, the state process (St)t N is not observable by the controller. Let ( Y0, if t = 0, (Zt 1, At 1, Yt), if t > 0, (1) be the history process, which is available to the controller at time t N, and Zt := (Zt, At) = (Y0, A0, . . . , Yt, At), (2) be the history-action process. Definition 2.1 (Admissible policy). An admissible control policy π = (πt)t N is a sequence of measurable mappings πt : (Y A)t Y (A), and the control at time t is chosen under πt randomly as P(At = a|Zt = zt) = πt(a|zt), for any zt (Y A)t Y. We denote the class of all admissible policies by ΠNM. If an action a is taken at state s, then a deterministic reward r(s, a) with |r(s, a)| r < is obtained. Definition 2.2 (Value function, Q-function, advantage function). Let π be an admissible policy, and µ (Y). The value function under π with discount factor γ (0, 1) is defined as Vπ t (zt) := Eπh X k=t γk tr(Sk, Ak) Zt = zt i , (3) Published in Transactions on Machine Learning Research (10/2025) for any zt (Y A)t Y. Similarly, the state-action value function (also known as Q-function) and the advantage function under π are defined as Qπ t ( zt) := Eπh X k=t γk tr(Sk, Ak) Zt = zt i , Aπ t (zt, a) := Qπ t (zt, a) Vπ t (zt), for any zt (Y A)t+1, respectively. Given an initial observation distribution µ (Y), the optimization problem is max π ΠNM Vπ(µ), (5) where Vπ(µ) := X y Y Vπ 0 (y0)µ(y0). We denote an optimal policy as π arg max π ΠNM Vπ(µ). Remark 2.3 (Curse of history in RL for POMDPs). Note that the problem in equation 5 is significantly more challenging than its subcase of (fully-observable) MDPs since there may not exist an optimal stationary policy (Krishnamurthy, 2016; Singh et al., 1994). As such, the policy search is over non-stationary randomized policies of type π = (π0, π1, . . .) where πt : (Y A)t Y (A) depends on the history of observations Zt = (Y0, A0, Y1, . . . , At 1, Yt) for t N. In this case, direct extensions of the existing reinforcement learning methods for MDPs become intractable, even for finite Y, A: the memory complexity of a non-stationary policy π ΠNM at epoch t N is O(|Y A|t+1), growing exponentially. In the following section, we formally introduce the RNN architecture that we study in this paper. 3 Independently Recurrent Neural Network Architecture We consider an independently recurrent neural network (Ind RNN) architecture in this work (Li et al., 2018; 2019). This architecture has been featured in POPGym (Morad et al., 2023) as it enables RNNs with large sequence lengths by handling long dependencies in practical applications. In other works, it has been shown to be effective for POMDPs in practice as well (Lu et al., 2024; Elelimy et al., 2024). Let Xt = (Yt, At) Rd, therefore Zt = (X0, X1, . . . , Xt) for any t Z+ by equation 2. The central structure in an Ind RNN is the sequence of hidden states Ht = (H(1) t , H(2) 2 , . . . , H(m) t ) Rm for t = 0, 1, . . ., which evolves according to H(i) t ( Zt; W, U) = ϱ Wii H(i) t 1( Zt 1; W, U) + Ui, Xt for all i [m], (6) with the initial condition H(i) 0 ( Z0; W, U) := ϱ( Ui, X0 ), where ϱ : R R is a smooth activation function, W = diag(W11, W22, . . . , Wmm) and U is an m d matrix whose i-th row is U i for i [m]. We assume a smooth activation function ϱ with |ϱ(z)| ϱ0, |ϱ (z)| ϱ1 and |ϱ (z)| ϱ2 for all z R, which is satisfied by many widely-used activation functions including tanh and the sigmoid function. We consider a linear readout layer with weights c Rm, which leads to the output Ft( Zt; W, U, c) = 1 m i=1 ci H(i) t ( Zt; W, U). (7) The operation of an independently recurrent neural network is illustrated in Figure 1. Following the neural tangent kernel literature, we omit the task of training the linear output layer c Rm for simplicity, and study the training dynamics of (W, U), which is the main challenge (Du et al., 2018; Oymak & Soltanolkotabi, Published in Transactions on Machine Learning Research (10/2025) Ht-1 Ht hidden state input Ft output Yt At Figure 1: An independently recurrent neural network (Ind RNN) in the RL context. 2020; Cai et al., 2019; Wang et al., 2019). Consequently, we denote the learnable parameters of an Ind RNN compactly in the vector form as Θ1 Θ2 ... Θm Rm(d+1) where Θi = Wii Ui Rd+1 for i [m]. (8) We use Θ and (W, U) interchangeably throughout the paper. A key feature of the neural tangent kernel analysis is the random initialization (Bai & Lee, 2019; Chizat et al., 2019; Cayci et al., 2023). Definition 3.1 (Symmetric random initialization). Let (c0, Θ0) = (c0 i , Θ0 i )i [m] be a random vector such that c0 i iid Rad(1), Θ0 i := W 0 ii U 0 i iid Rad(α) N(0, Id) c0 i+m/2 = c0 i and Θ0 i+m/2 = Θ0 i for i = 1, 2, . . . , m 2 . We call (c0, Θ0) a symmetric random initialization, and denote the distribution of (c0, Θ0) as ζ0. For both policy optimization (Algorithm 1) and policy evaluation (Algorithm 2), the Ind RNNs are randomly initialized according to Definition 3.1. Such random initialization schemes are widely adopted in practice, and play a fundamental role in the theoretical analysis of deep learning algorithms Bai & Lee (2019); Chizat et al. (2019); Wang et al. (2019); Cai et al. (2019); Liu et al. (2019). In the following subsection, we define the reference function class determined by overparameterized Ind RNNs in a detailed way, which will be instrumental in the theoretical results and their analyses. We note that this subsection can be skipped for those who would like to focus on the algorithmic design. 3.1 Reference Function Class for Independently Recurrent Neural Networks A fundamental question in reinforcement learning with function approximation is to determine a concrete reference function class for the function approximation architecture that is used for approximation in the value and policy spaces (Bertsekas & Tsitsiklis, 1996). In this subsection, we will identify and discuss the reference function class defined by the Ind RNN architecture that will be used for incorporating memory to solve POMDPs. In order to motivate the discussion, we first overview basic reference function classes for (fully-observable) MDPs, then extend the discussion to POMDPs. Function approximation in MDPs. Let us consider value-based reinforcement learning in the case of MDPs, where the objective is to learn the Q-function under a given stationary policy π. The approximation Published in Transactions on Machine Learning Research (10/2025) error for a given reference class F of functions f : S A R is ϵapp(F) := inf f F Es,a[(Qπ(s, a) f(s, a))2]. (9) For example, if a linear function approximation scheme with a given feature map ϕ : S A Rp is used, then the reference function class is F := {(s, a) 7 θ ϕ(s, a) : θ Rp} = span(Φ) where Φ := [ϕ (s, a)]s,a is the feature matrix. In the case of linear MDPs Jin et al. (2020), we have Qπ F and ϵapp(F) = 0; otherwise TD(0) with this linear approximation scheme has an inevitable approximation error 1 1 γ ϵapp(F) (Bertsekas & Tsitsiklis, 1996). The reference function class for a randomly-initialized single hidden-layer feedforward neural network with frozen output layer is FNTK := {(s, a) 7 Eu0 N(0,Id)[v(u0) uϱ( (s, a), u0 )] such that Eu0 N(0,Id)[ v(u0) 2 2] < }, (10) where v : Rd Rd (Liu et al., 2019; Wang et al., 2019; Cayci et al., 2023). Technically, the completion of FNTK yields the reproducing kernel Hilbert space (RKHS) of the so-called neural tangent kernel κ (x, x ) := Eu0[ u ϱ(u 0 x) uϱ(u 0 x )] = x x E[ϱ (u 0 x)ϱ (u 0 x )] for any x, x S A and its explicit analysis shows that it is provably rich (Ji et al., 2019). For a detailed discussion on the function space FNTK and its role in reinforcement learning, we refer to Section A.2 in Liu et al. (2019) and Cayci et al. (2024b). Due to the concrete approximation bounds for FNTK, the representational assumption Qπ FNTK is standard in the theoretical analyses of neural TD learning for MDPs, and the objective is to prove that neural TD learning can learn any Qπ FNTK using samples with finite-time and finitesample guarantees (Cai et al., 2019; Wang et al., 2019; Cai et al., 2019; Cayci et al., 2023). Without the representational assumption Qπ FNTK, the optimality guarantees in Cai et al. (2019); Liu et al. (2019); Wang et al. (2019); Cayci et al. (2023) hold up to an additional error term proportional to 1 1 γ ϵapp(FNTK). Function approximation in RL for POMDPs. Analogous to the approximation error analysis in RL for MDPs discussed earlier, our objective here is to identify a suitable reference function class for the Ind RNN architecture defined in equation 7. Building on the framework of Cayci & Eryilmaz (2025), we present an infinite-width characterization of Ind RNNs in the neural tangent kernel (NTK) regime. This directly extends the reference function class FNTK in 10 for feedforward neural networks in the neural RL literature (Cai et al., 2019; Wang et al., 2019; Cayci et al., 2024b) to the partially observable setting with recurrent models. We note that our reference function class reduces to the feedforward neural networks as a specific case (see Remark 3.4). For any t Z+ and input Z, symmetric initialization ensures that Ft( Z; Θ0) = 0. Furthermore, the first-order Taylor expansion of Ft at Θ Rm(d+1) around Θ0 yields Ft( Z; Θ) = ΘFt( Z; Θ0) Θ Θ0 + O Θ Θ0 2 As m , the linear part ΘFt( Z; Θ0) Θ Θ0 is able to approximate a rich class of functions determined by the reproducing kernel Hilbert space (RKHS) of the recurrent neural tangent kernel defined as κt( Z, Z ) := lim m ΘFt( Z; Θ0) ΘFt( Z; Θ0), for t Z+. In the following, we characterize this sequence of reproducing kernel Hilbert spaces for t Z+ explicitly, following Cayci & Eryilmaz (2025). Let w0 Rad(α) and u0 N(0, Id) be independent random variables, and θ0 := (w0, u0). Given a sequence z = (x0, x1, . . .) (Y A)Z+, let ht( zt; θ0) := ϱ(w0ht 1( zt 1; θ0) + u0, xt ) for t = 0, 1, 2, . . . , with the initial condition h 1 := 0. and It( zt; θ0) := ϱ (w0ht 1( zt 1; θ0) + u0, xt ). Published in Transactions on Machine Learning Research (10/2025) Then, the neural tangent random feature mapping1 at time t is defined as ψt( zt; θ0) := ht k 1( zt k 1; θ0) xt k j=0 It j( zt j; θ0), Based on the sequence of neural tangent random features, the neural tangent random feature matrix is defined as Ψ( z; θ0) = Ψ ( z; θ0), where ΨT ( z; θ0) := ψ 0 ( z0; θ0) ψ 1 ( z1; θ0) ... ψ T 1( z T 1; θ0) for any T Z+. Definition 3.2 (Transportation mapping). Let H be the set of mappings v : R1+d R1+d such that v(θ0) := vw(θ0) vu(θ0) for θ0 = (w0, u0) with E[ v(θ0) 2 2] < , where w0 Rad(α) and u0 N(0, Id). We call v H a transportation mapping, following Ji & Telgarsky (2019); Ji et al. (2019). Definition 3.3 (Reference function class for Ind RNNs). We define the reference function class of Ind RNNs for any sequence-length T 1 as z 7 E [ΨT ( z; θ0)v(θ0)] = f 0 ( z0; v) ... f T 1( z T 1; v) : v H , z (Y A)Z+ where f t ( zt; v) := E[ψ t ( zt; θ0)v(θ0)] for any z (Y A)Z+. The same transportation mapping v is used to define f t for all t N, which is a characteristic feature of weight-sharing in RNNs. We denote F := F . Remark 3.4 (Reduction to FNTK). Note that setting T = 1 yields the random feature map ψt( z0; θ0) = 0 uϱ( u0, x0 ) since uϱ( x0, u0 ) = x0ϱ ( x0, u0 ). Hence, for any v H , we have F1 = {x0 7 E[vu(u0) uϱ( x0, u0 )] : E vu(u0) 2 2 < }, which is exactly the reference function class FNTK for feedforward neural networks given in equation 10. In other words, {FT : T Z+} contains FNTK with F1 = FNTK, which is the reference function class in neural RL literature for MDPs (Wang et al., 2019; Liu et al., 2019). F1 is dense in the space of continuous functions on a compact set (Ji et al., 2019). Remark 3.5 (Fully-connected RNNs). Ind RNNs utilize a diagonal hidden-to-hidden weight matrix W, which was shown to be very effective in handling long-term dependencies in RL compared to conventional RNNs, GRU and LSTM architectures (Morad et al., 2023). In addition to its practical benefits, Ind RNNs have theoretical niceties as well, as they enable (i) explicit characterization of the reference function class, and (ii) direct control and analysis of the spectral radius of W. Both of these theoretical amenities are lost when W does not inherit a diagonal structure. 3.2 Max-Norm Projection for Ind RNNs Given an initialization (W(0), U(0), c) as in Definition 3.1 and a vector ρ = (ρw, ρu) R2 >0 of projection radii, we define the compactly-supported set of weights Ωρ,m Rm(d+1) as Ωρ,m = n Θ Rm(d+1) : max i |Wii Wii(0)| ρw m, max i Ui Ui(0) ρu m 1The feature uses a complicated weighted-sum of all past inputs xk, k t, leading to a discounted memory to tackle non-stationarity. xt k is scaled with wk 0 Rad(α), thus it yields a fading memory approximation of the history if α < 1. Published in Transactions on Machine Learning Research (10/2025) Given any symmetric random initialization (W(0), U(0), c) and ρ R2 >0, the set Ωρ,m is a compact and convex subset of Rm(d+1), and for any Θ Ωρ,m, we have max 1 i m |Wii Wii(0)| ρw m, max 1 i m Ui Ui(0) ρu m. ProjΩρ,m[Θ] = w B2 Wii(0), ρw m |Wii wi|, arg min ui B2 Ui(0), ρu m Ui ui 2 As such, the projection operator ProjΩρ,m[ ] onto Ωρ,m is called the max-norm projection (or regularization) (Goodfellow et al., 2013; Srebro et al., 2004). As an immediate consequence, Θ Ωρ,m implies that |Wii| |Wii Wii(0)| + |Wii(0)| α + ρw m =: αm, which implies a strict control over maxi [m] |Wii|. As we will see in Section 5 and Section 6, such a strict control over the norm of the hidden-to-hidden weights Wii has a significant importance in stabilizing the training of Ind RNNs. Similar projection mechanisms for Ind RNNs are adopted in practice as well (Morad et al., 2023). For further details, we refer to Appendix A. 4 Rec-NAC: A High-Level Algorithmic View In this section, we present a high-level description of our Recurrent Natural Actor-Critic (Rec-NAC) Algorithm with two inner loops, critic (called Rec-TD) and actor (called Rec-NPG), for policy optimization with RNNs. The details of the inner loops of the algorithm will be given in the succeeding sections. We use an admissible policy π = (πt)t N that is parameterized by a recurrent neural network (Ft( ; Φ))t N of the form given in equation 7 with a network width m Z+. To that end, for any t N, let πΦ t (a|zt) := exp (Ft((zt, a); Φ)) P a A exp (Ft((zt, a ); Φ)), (15) for any zt (Y A)t Y and a A with the parameter Φ Rm(d+1). The high-level operation of Rec-NAC is summarized in Algorithm 1. Algorithm 1 Recurrent Natural Actor-Critic (Rec-NAC) a High-level description 1: Initialize the actor RNN as (c, Φ(0)) ζ0 (see Definition 3.1). 2: for n = 0, 1, 2, . . . , N 1 do 3: Critic. Independently initialize the weights of the critic Ind RNN as (cn, Θn(0)) iid ζ0. 4: Run Rec-TD in Algorithm 2 for Ktd iterations, and obtain Θn := K 1 td P k 0, max-norm projection radius ρ = (ρw, ρu), sequence-length T. 2: Initialize (c, Θ(0)) ζ0 according to Definition 3.1. 3: for k = 0, 1, 2, . . . , K 1 do 4: Sample an initial state Sk 0 µ independently. 5: Observe Y k 0 Φ(Sk 0 , ). 6: Choose an action Ak 0 π0( |Zk 0 ). 7: Set ˇ Rk T := 0. 8: for t = 0, 1, . . . , T do 9: State transition Sk t+1 P((Sk t , Ak t ), ). 10: Observe Y k t+1 Φ(Sk t+1, ). 11: Choose an action Ak t+1 πt+1( |Zk t+1). 12: Compute temporal difference δt( Zk t , Θ(k)) where δt( zt+1; Θ) := rt + γFt+1( zt+1; Θ) Ft( zt; Θ). 13: Update stochastic semi-gradient: ˇ Rk T ˇ Rk T + γtδt( Zk t+1; Θ(k)). 14: end for 15: Parameter update with max-norm projection Θ(k + 1) = ProjΩρ,m Θ(k) + η ˇ Rk T . 16: end for Assumption 5.3 is a representational assumption, stating that (Qπ t )t lies in the RKHS induced by the random features ΨT ( z; θ0) defined in equation 12. It directly extends Assumption 4.1 in Wang et al. (2019) and Assumption 2 in Cayci et al. (2024b) to POMDPs, and exactly recovers these assumptions when T = 1 (see Remark 3.4). Theorem 5.4 (Finite-time bounds for Rec-TD). Under Assumptions 5.1-5.3, for any projection radius ρ ν = (νw, νu) and step-size η > 0, Rec-TD with max-norm projection achieves the following error bound: k=0 Rπ T (Θ(k)) i 1 ν 2 2 (1 γ) + C(1) T (1 γ)3 + C(2) T (1 γ)2 m + γT k=0 ω2 T,k | {z } ( ) for any K N, where C(1) T , C(2) T = poly p T ((α + ρwm 1/2)ϱ1), ρ 2, ν 2 , are instance-dependent constants that do not depend on K, and ωt,k := q E[(Ft( Zt; Θ(k)) Qπ t ( Zk t ))2] is a uniformly bounded sequence for t, k N. Furthermore, the loss at average-iterate, E[Rπ T 1 K PK 1 k=0 Θ(k) ], admits the same upper bound as the regret upper bound in equation 17, up to a multiplicative factor of 10. The proof of Theorem 5.4 can be found in Section B. Assumption 5.1 is critical to obtain finite-time bounds in Theorem 5.4, and holds when the system can be restarted independently from the initial state distribution Bhandari et al. (2018). In the specific case of fullyobservable MDPs, the process {(Sk, Ak) : k N} is a Markov chain under any stationary policy, and mixing time arguments under uniform ergodicity assumptions are used for analysis under Markovian sampling from a single trajectory without independent restarts (Bhandari et al., 2018; Cayci et al., 2023). On the other hand, in the case of POMDPs, {(Sk, Ak) : k N} is not a Markov chain under a general non-stationary Published in Transactions on Machine Learning Research (10/2025) policy π. In the specific case of policies parameterized by RNNs with hidden state {Hk : k N}, the augmented process {(Sk, Ak, Yk, Hk) : k N} forms a Markov process. The challenge here is that the state space for this augmented Markov process may be very large or even continuous, and standard theoretical tools (e.g., mixing time arguments) can become much more involved. Under Assumption 5.3, Theorem 5.4 implies the global ϵ-optimality of Rec-TD as the sequence-length T for sufficiently large number of iterations K = O(C(1) T /ϵ2) and network width m = O(C(2) T /ϵ2). If we omit Assumption 5.3, the error bound in Theorem 5.4 still holds with an additional error term O 1 1 γ ϵapp(FT ) where ϵapp(FT ) := inf f FT Eπ µ t=0 γt ft( Zt) Qπ t ( Zt) 2 # is the function approximation error. Remark 5.5 (Overcoming perceptual aliasing with Rec-TD). Memoryless TD learning suffers from a nonvanishing optimality gap in POMDPs, known as perceptual aliasing (Singh et al., 1994). To address this, Rec-TD integrates T-step stochastic approximation with an RNN, enabling it to retain memory. Accordingly, Theorem 5.4 establishes that as T , Rec-TD reduces Rπ to arbitrarily small values, given sufficiently large network width m and iteration count K. Remark 5.6 (The impact of long-term dependencies). Note that both constants C(1) T , C(2) T polynomially depend on p T (ϱ1αm). As noted in Goodfellow et al. (2016), the spectral radius of {W(k) : k N} determines the degree of long-term dependencies in the problem as it scales Ht. Consistent with this observation, our bounds depend on αm := α + ρw m λmax(W (k)W(k)) = max i [m] |Wii(k)|, for any k N. Note that Theorem 5.4 requires ρw νw, thus maxi [m] |Wii(k)| should be sufficiently large depending on the RKHS norm ν. Let ε > 0 be any given target error. Short-term memory. If αm < 1 ϱ1 , then it is easy to see that p T (ϱ1αm) 1 1 ϱ1αm . Thus, the extra term ( ) in equation 17 vanishes at a geometric rate as T , yet m (network-width) and K (iteration-complexity) are still O(1/ε2). Rec-TD is very efficient in that case. Long-term memory. If αm > 1 ϱ1 , as T , both m and K grow at a rate O (ϱ1αm)T /ε2 while the extra term ( ) in equation 17 vanishes at a geometric rate. As such, the required network size and iterations grow at a geometric rate with T in systems with long-term memory, constituting the pathological case. Theorem 5.4 emphasizes the critical importance of max-norm projection and large neural network size m in stabilizing the training of Ind RNNs by Rec-TD, and guides the choice of the projection radius ρ. Interestingly, if {Qπ t : t < T} FT has an RKHS norm νw 1/ϱ1, then Rec-TD with a projection radius ρw νw and overparameterization m 1 yields significantly improved policy evaluation performance in terms of C(1) T , C(2) T for large T. Similar projection mechanisms on {Wii : i [m]} are widely used for Ind RNNs in practice, for instance in Morad et al. (2023), to enhance stability. The performance of Rec-TD is studied numerically in Random-POMDP instances in Section C. 6 Actor: Recurrent Natural Policy Gradient (Rec-NPG) for POMDPs The goal is to solve the following problem for a given initial distribution µ (Y) and ρ R2 >0: max Θ Rm(d+1) VπΦ(µ) such that Φ Ωρ,m, (PO) Published in Transactions on Machine Learning Research (10/2025) 6.1 Recurrent Natural Policy Gradient for POMDPs In this section, we describe the recurrent natural policy gradient (Rec-NPG) algorithm for non-stationary reinforcement learning. First, we formally establish in Prop. D.2 that the policy gradient under partial observability takes the form ΦVπΦ(µ) := EπΦ µ t=0 γt QπΦ t (Zt, At) Φ ln πΦ t (At|Zt) where the state St in the MDP framework is replaced by the process history Zt in POMDP. Fisher information matrix under a policy πΦ is defined as Gµ(Φ) := EπΦ µ t=0 γt ln πΦ t (At|Zt) ln πΦ t (At|Zt) for an initial observation distribution µ (Y). Rec-NPG updates the policy parameters by Φ(n + 1) = Φ(n) + η G µ(Φ(n)) ΦVπΦ(n)(µ), (18) for an initial parameter Φ(0) and step-size η > 0, where G denotes the Moore-Penrose inverse of a matrix G. This update rule is in the same spirit as the NPG introduced in Kakade (2001), however, due to the non-stationary nature of the partially observable MDP, it has significant complications that we will address. In order to avoid computationally-expensive policy updates in equation 18, we utilize the following extension of the compatible function approximation in Kakade (2001) to the case of non-stationary policies for POMDPs. Proposition 6.1 (Compatible function approximation for non-stationary policies). For any Φ Rm(d+1) and initial observation distribution µ, let Lµ(w; Φ) = EπΦ µ t=0 γt ln πΦ t (At|Zt)ω AπΦ t ( Zt) 2 # for ω Rm(d+1). Then, we have G µ(Φ) ΦVπΦ(µ) arg min ω Rm(d+1)Lµ(ω; Φ). (20) We have the following remark regarding the intricacies of compatible function approximation in the POMDP setting. Remark 6.2 (Path-based compatible function approximation with truncation). For MDPs, the compatible function approximation error Lµ(w; Φ) can be expressed by using the discounted state-action occupancy measure, from which one can obtain unbiased samples (Agarwal et al., 2020; Konda & Tsitsiklis, 2003). Thus, the infinite-horizon can be handled without any loss. On the other hand, for POMDPs as in equation 19, this simplification is impossible due to the non-stationarity. As such, we use a path-based method for a sequence-length T N with ℓT (ω; Φ, Q) := t=0 γt( ln πΦ t (At|Zt)ω At(Zt, At))2, where At(zt, at) = Qt(zt, at) P a A πΦ t (a|zt)Qt(zt, a) is the advantage function. Given a policy with parameter Φ(n), the corresponding output of the critic, which is obtained by Rec-TD with the average-iterate as ˆQ(n)( ) := Ft( ; Θn) for Θn := 1 Ktd k0 yields min 0 n ϱ 1 1 , then m and N should grow at a rate (αmϱ1)T , implying the curse of dimensionality (more generally, it is known as the exploding gradient problem Goodfellow et al. (2016)). On the other hand, if αm < ϱ 1 1 , then Lt, βt, Λt, χt are all O(1) for all t, implying efficient learning of POMDPs. This establishes a very interesting connection between the memory in the system, the continuity and smoothness of the RNN with respect to its parameters, and the optimality gap under Rec-NPG. Published in Transactions on Machine Learning Research (10/2025) The term 2γT r (1 γ)2 is due to truncating the trajectory at T, and vanishes with large T. Rec-NPG achieves ϵ-optimality (up to the compatible function approximation and truncation errors) with N = O(1/ϵ2) steps and m = O(1/ϵ4) neural network width for any ϵ > 0. Remark 6.5. The quantity κ in Proposition 6.8 is the so-called concentrability coefficient in policy gradient methods (Agarwal et al., 2020; Bhandari & Russo, 2019; Wang et al., 2019), and determines the complexity of exploration. Note that it is defined in terms of path probabilities P π,µ T in the non-stationary setting. By making the assumption κ < , we assume that the policies πΦ(n) perform sufficient exploration to visit each trajectory visited by π with positive probability. In order to establish similar bounds without this assumption, entropic regularization is widely used to encourage exploration in practical scenarios Ahmed et al. (2019); Cen et al. (2020); Cayci et al. (2024c). The benefits of using entropic regularization in policy optimization for POMDPs to encourage exploration is an interesting future research direction. In the following, we decompose the compatible function approximation error εT cfa into the approximation error for the RNN and the statistical errors. To that end, let εapp,n = inf ω Ωρ,m E X t0, we consider a class HJ,ν of transportation mappings v(j) H : j J, sup θ Rd+1,j J |v(j) w (θ)| sup θ Rd+1,j J v(j) u (θ) 2 Published in Transactions on Machine Learning Research (10/2025) and also the corresponding infinite-width limit FJ,ν := { z 7 E[Ψ( z; θ0)v(θ0)] : v Conv(HJ,ν)}, where Ψ( ; θ0) is the NTRF matrix, defined in equation 12. We assume that there exists an index set J and ν R2 >0 such that QπΦ(n) FJ,ν for all n N. This representational assumption implies that the Q-functions under all iterate policies πΦ(n) throughout the Rec-NPG iterations n = 0, 1, . . . can be represented by convex combinations of a fixed set of mappings in the NTK function class F indexed by J. As we will see, the richness of J as measured by a relevant Rademacher complexity will play an important role in bounding the approximation error. To that end, for zt = (zt, at) (Y A)t+1, let G zt t := {ϕ 7 ϕ H(1) t ( zt; ϕ)v(ϕ) : v HJ,ν}, Radm(G zt t ) := E ϵ Radm(1) Φ(0) ζinit i=1 ϵig(Φi(0)). Note that v HJ,ν above can be replaced with v Conv(HJ,ν) without any loss. In that case, since the mapping v(j) 7 f t ( zt; v(j)) G zt t is linear, G zt t is replaced with Conv(G zt t ) without changing the Rademacher complexity (Mohri et al., 2018). The following provides a finer characterization of the approximation error. Proposition 6.8. Under Assumption 6.7, if ρ ν, then ϵapp,n 1 1 γ 2 max 0 t