# reinforcement_learning_for_infinitedimensional_systems__580c27a2.pdf

Journal of Machine Learning Research 26 (2025) 1-52 Submitted 9/24; Revised 5/25; Published 7/25

Reinforcement Learning for Inﬁnite-Dimensional Systems

Wei Zhang wei.zhang@wustl.edu Department of Electrical & Systems Engineering Washington University in St. Louis St. Louis, MO 63130, USA

Jr-Shin Li jsli@wustl.edu Department of Electrical & Systems Engineering Division of Computational & Data Sciences Division of Biology & Biomedical Sciences Washington University in St. Louis St. Louis, MO 63130, USA

Editor: Nan Jiang

Interest in reinforcement learning (RL) for large-scale systems, comprising extensive populations of intelligent agents interacting with heterogeneous environments, has surged signiﬁcantly across diverse scientiﬁc domains in recent years. However, the large-scale nature of these systems often leads to high computational costs or reduced performance for most state-of-the-art RL techniques. To address these challenges, we propose a novel RL architecture and derive eﬀective algorithms to learn optimal policies for arbitrarily large systems of agents. In our formulation, we model such systems as parameterized control systems deﬁned on an inﬁnite-dimensional function space. We then develop a moment kernel transform that maps the parameterized system and the value function into a reproducing kernel Hilbert space. This transformation generates a sequence of ﬁnite-dimensional moment representations for the RL problem, organized into a ﬁltrated structure. Leveraging this RL ﬁltration, we develop a hierarchical algorithm for learning optimal policies for the inﬁnite-dimensional parameterized system. To enhance the algorithm s eﬃciency, we incorporate early stopping at each hierarchy, demonstrating the fast convergence property of the algorithm through the construction of a convergent spectral sequence. The performance and eﬃciency of the proposed algorithm are validated using practical examples in engineering and quantum systems.

Keywords: Reinforcement learning, ensemble systems, moment kernelization, spectral sequence convergence, control theory

1. Introduction

Reinforcement learning (RL), a prominent machine learning paradigm, has gained recognition as a powerful tool for intelligent agents to learn optimal policies. These policies guide the agents decision-making processes toward achieving desired goals by maximizing the expected cumulative rewards. RL exhibits a wide range of applications across various domains, encompassing robotics, game-playing, recommendation systems, autonomous vehicles, and control systems. In recent years, learning optimal policies to manipulate the behavior of

c 2025 Wei Zhang and Jr-Shin Li.

License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v26/24-1575.html.

Zhang and Li

large-scale systems of intelligent agents interacting with diﬀerent environments has attracted increasing attention and is gradually becoming a recurrent research theme in the RL community. Existing and potential applications of large-scale RL include targeted coordination of robot swarms for motion planning in robotics (Becker and Bretl, 2012; Shahrokhi et al., 2018), desynchronization of neuronal ensembles with abnormal oscillatory activity for the treatment of movement disorders such as Parkinson s disease, epilepsy, and essential tremor in neuroscience and brain medicine (Marks, 2005; Wilson, 2005; Shoeb, 2009; Zlotnik and Li, 2012; Li et al., 2013; Vu et al., 2024), and the robust excitation of spin ensembles for nuclear magnetic resonance (NMR) spectroscopy, imaging (MRI), and quantum computation in quantum science and technology (Glaser et al., 1998; Li et al., 2011; Dong et al., 2008; Chen et al., 2014). The fundamental challenge in these RL tasks undoubtedly lies in the massive scale of agent systems and the dynamic environments in which the agents take actions. For instance, a neuronal ensemble in the human brain may comprise up to 1011 neurons (Herculano-Houzel, 2012; Ching and Ritt, 2013), and a quantum ensemble in NMR experiments consists of spins on the order of Avogadro s number 1023 (Li, 2006; Cavanagh et al., 2010; Li, 2011). Although, mathematically speaking, these systems are composed of ﬁnitely many agents, it is more appropriate to treat them as inﬁnite populations. This is because their massive scale makes it infeasible to speciﬁcally identify and make decisions for each individual agent in these populations. These restrictions particularly imply that such policy learning tasks exceed the capabilities of classical multi-agent RL methods, which focus on learning joint policies that provide each agent with a customized action based on the states of all the other agents (Littman, 1994; Foerster et al., 2016; Gupta et al., 2017; Zhang et al., 2021b; Albrecht et al., 2024). Consequently, all the agents are required to take the same action, necessitating the learning and implementation of policies at the population level. In addition, natural models describing the dynamic behavior of these intelligent agents typically take the form of continuous-time deterministic control systems. Learning optimal policies for this category of systems also lies outside the scope of most state-of-the-art and benchmark RL algorithms, notably Trust Region Policy Optimization (TRPO) and Proximal Policy Optimization (PPO) (Schulman et al., 2017, 2015), which are based on Markov decision processes (MDPs) and operate as discrete-time stochastic control processes. This, in turn, underscores the urgent demand for a more inclusive RL framework that accommodates policy learning tasks for inﬁnite populations of agents in a continuous-time deterministic setting.

Our contributions. This work is devoted to developing a novel RL architecture that enables the derivation of eﬀective algorithms to learn optimal policies for population systems consisting of inﬁnitely many intelligent agents interacting with heterogeneous dynamic environments. We formulate such an agent system as a parameterized control system, where each individual system indexed by a speciﬁc parameter value represents the environment of an agent in the population. We then evolve this parameterized system on an inﬁnite-dimensional function space and carry out a functional setting for RL of this inﬁnite-dimensional dynamical system. The primary tool we develop to tackle this RL problem is the moment kernel transform. This transformation maps the parameterized system and the value function to a control system and a value function deﬁned on a reproducing kernel Hilbert space (RKHS) consisting of moment sequences, thus yielding a kernel parameterization of the

Reinforcement Learning for Infinite-Dimensional Systems

RL problem. The use of moment sequences for this kernelization directly enables ﬁnitedimensional truncation representations of the inﬁnite-dimensional RL problem. We then organize these representations into a ﬁltrated structure with respect to the truncation order. Leveraging this RL ﬁltration, we develop a hierarchical policy learning algorithm, wherein each hierarchy consists of an RL problem for a ﬁnite-dimensional truncated momentkernelized system. To enhance the computational eﬃciency of the proposed algorithm, we also develop early stopping criteria for the ﬁnite-dimensional RL problem in each hierarchy and prove convergence of the hierarchical algorithm with early-stopped hierarchies in terms of spectral sequences. The performance and eﬃciency of the proposed hierarchical policy learning algorithm are demonstrated using examples arising from practical applications. The contributions of our work are summarized as follows.

Formulation of an inﬁnite population of intelligent agents as a parameterized control system deﬁned on an inﬁnite-dimensional function space.

Development of the moment kernel transform that gives rise to a kernel parameterization of RL problems for parameterized systems in terms of moment sequences in an RKHS.

Design of a ﬁltrated RL algorithm for learning optimal policies of parameterized systems with convergence guarantees.

Exploration of early stopping criteria for each hierarchy in the proposed hierarchical algorithm and proof of the spectral sequence convergence of the hierarchical algorithm with early-stopped hierarchies.

Related works. Learning control policies to coordinate large populations of intelligent agents using RL approaches has attracted increasing attention in recent years. Signiﬁcant achievements have been recognized in the ﬁeld of control theory and engineering. Prominent examples include control of multi-agent systems (Buşoniu et al., 2010; Heredia and Mou, 2019; Jiang et al., 2021; Zhang et al., 2021b; Albrecht et al., 2024) and many-body quantum systems (Dong et al., 2008; Lamata, 2017; Bukov et al., 2018; Haug et al., 2021), as well as planning and regulation for multi-robot systems (Matarić, 1997; Long et al., 2018; Yang et al., 2020a; Yang and Gu, 2004; Lu et al., 2024). Furthermore, large-scale RL has also gained great success in various emerging applications that are generally considered to be outside the traditional control-theoretic scope, notably, strategic gaming (Silver et al., 2017; Sil, 2018; Vinyals et al., 2019; Open AI et al., 2019; Schrittwieser et al., 2020; Kaiser et al., 2020) and human-level decision making (Mnih et al., 2015; Liu et al., 2022; Baker et al., 2020). One of the most active research focuses regarding large-scale and high-dimensional policy learning problems is deep RL, which incorporates deep learning techniques, particularly the use of deep neural networks, into RL algorithms (Bertsekas and Tsitsiklis, 1996; François-Lavet et al., 2018; Bellemare et al., 2020; Nakamura-Zimmerer et al., 2021; Le et al., 2022; Sarang and Poullis, 2023). Despite the advantage of exceptional generalizability, training deep neural networks to tackle large-scale and/or high-dimensional learning problems is widely known to suﬀer from the notorious curse of dimensionality, leading to expensive computational costs and scalability issues. Various approaches have been proposed to mitigate the impact of these phenomena, primarily including the development of distributed and multi-agent RL

Zhang and Li

algorithms (Heredia and Mou, 2019; Heredia et al., 2020; Yazdanbakhsh et al., 2020; Heredia et al., 2022; Xie et al., 2024) and the search for compact representations of high-dimensional measurement data. Notable among these are the successor representation (Momennejad et al., 2017), latent space representation (Gelada et al., 2019), contrastive unsupervised representation (Laskin et al., 2020), and invariant representation (Zhang et al., 2021a). These works have achieved considerable success in learning optimal policies for agent populations on a scale ranging from tens to thousands. To further push the boundaries of RL towards addressing larger agent populations, tools in mean-ﬁeld theory - particularly mean-ﬁeld games and control - have drawn increasing attention and have been adopted in the RL setting in recent years. Instead of learning a policy for each individual agent in a population, mean-ﬁeld RL focuses on the mean-ﬁeld approximation of the population, where each collection of interacting agents is replaced by a single agent representing their averaged behavior (Yang et al., 2018; Subramanian and Mahajan, 2019; Laurière et al., 2022; Pásztor et al., 2023; Bensoussan et al., 2013; Carmona et al., 2019; Fu et al., 2020; Carmona et al., 2020). However, the prerequisite of the mean-ﬁeld approximation, arising from the fundamental principles of statistical physics, places the primary focus of mean-ﬁeld RL on populations of identical agents (Pathria and Beale, 2021). In this context, policy learning for arbitrarily large, and in the limit, inﬁnite populations of heterogeneous intelligent agents, as considered in this work, remains under-explored with sparse literature in the RL community. On the other hand, the formulation of RL problems over inﬁnite-dimensional spaces has only been proposed in the stochastic setting, aiming to learn feedback control policies for stochastic partial diﬀerential equation systems using variational optimization methods (Evans et al., 2020). The major technical tool developed in this work to overcome the curse of dimensionality is the moment kernel transform, which is inspired by the method of moments. This method was developed by the Russian mathematician P. L. Chebyshev in 1887 to prove the law of large numbers and the central limit theorem (Mackey, 1980). Since then, the method has been extensively studied under diﬀerent settings, notably the Hausdorﬀ, Hamburger, and Stieltjes moment problems (Hamburger, 1920, 1921a,b; Hausdorﬀ, 1923; Stieltjes, 1993). The most general formulation in modern terms was proposed by the Japanese mathematician K osaku Yosida (Yosida, 1980). Recently, the method of moments was introduced to control theory for establishing dual representations of ensemble systems (Narayanan et al., 2024; Li et al., 2025) and in machine learning-aided medical decision making as a feature engineering technique (Yu et al., 2023). These prior works lay the foundation for the development of the moment kernel transform in this work. In addition to establishing the ﬁltrated RL architecture, the moment kernel transform also gives rise to a reduced kernel representation of (inﬁnite) agent populations over an RKHS. It is widely known that RKHS theory forms the building blocks for kernel methods in machine learning, notably support vector machines (SVMs) and kernel principal component analysis (kernel PCA) (Hastie et al., 2009; Paulsen and Raghupathi, 2016). In the context of RL, elements of RKHSs are commonly used as function approximators for MDPs, through which the learning targets, including policies, value functions, and/or transition maps, are estimated in terms of linear combinations of reproducing kernels (Lever and Staﬀord, 2015; Yang and Wang, 2019; Yang et al., 2020b; Koppel et al., 2021). This kernel approximation is typically enabled by imposing the condition that these learning targets belong to an RKHS

Reinforcement Learning for Infinite-Dimensional Systems

consisting of functions deﬁned on the state space of the MDPs. In our work, we explore the use of RKHS-theoretic techniques within a functional setup rather than the traditional MDP setup, thus relaxing the condition on the learning targets. More importantly, along with enabling kernel approximation, the developed moment kernel transform also serves as a model reduction machine for arbitrarily large agent populations.

2. Policy Learning for Inﬁnite Dynamic Populations

This section is primarily dedicated to establishing a general formulation for RL of arbitrarily large populations of intelligent agents, which interact with heterogeneous environments in continuous time and are regulated by policies taking values in continuous spaces. We begin by demonstrating the challenges associated with such RL tasks, and then introduce our formulation that models these populations as continuous-time parametrized control systems deﬁned on inﬁnite-dimensional function spaces. We subsequently delve into imposing conditions that guarantee the existence of optimal policies for RL problems involving these inﬁnite-dimensional control systems over function spaces.

2.1 Challenges to reinforcement learning of large-scale population agents

Learning control policies for a large-scale population of intelligent agents, eﬀectively a continuum in the limit, described by continuous-time deterministic dynamical systems, presents a signiﬁcant challenge to RL. The primary obstacle is the enormous population size, which forces RL algorithms to operate in high-dimensional spaces. This inevitably leads to the curse of dimensionality, resulting in high computational costs and reduced learning accuracy (Bellman et al., 1957; Bellman, 1961; Sutton and Barto, 2018). Additionally, the continuous-time deterministic system formulation is inconsistent with the setting of most state-of-the-art RL algorithms, which are typically based on MDPs modeled by discrete-time stochastic control processes. To illustrate these challenges with a concrete example, we consider a population system consisting of N dynamic agents regulated by a common control policy, given by

d dtxi(t) = aixi(t) + u(t), i = 1, . . . , n, (1)

where xi(t) R is the state of agent i, u(t) R is the control policy, and ai R for all i. The entire size-n population can be represented as an n-dimensional system, d dtx(t) = Anx(t) + Bnu(t), where x(t) = [ x1(t), . . . , xn(t) ] Rn is the population state with denoting the transpose of vectors (and matrices), AN Rn n is the diagonal matrix with the (i, i)-entry given by ai for all i = 1, . . . , n, and BN Rn is a vector of ones. To put the analysis in a simpliﬁed setting, we choose 1 = a0 < a1 < < an = 1 to be the uniform partition of the interval [ 1, 1], i.e., ai = 1 + 2(i 1)/(n 1) for each i = 1, . . . , n. We aim to learn the inﬁnite-time horizon linear quadratic regulator (LQR) with the state-value function (cumulative reward or cost-to-go), V (x(t)) = R t e 2.5th 1 n Pn i=1 x2 i (t) + u2(t) i dt = R t e 2.5th 1 nx (t)x(t) + u2(t) i dt. It is well-known that the LQR value function V (x(t)) =

infu V (x(t)) is in the quadratic form V (x(t)) = x (t)Qx(t), parameterized by a positive deﬁnite matrix Q Rn n (Brockett, 2015). Therefore, the value iteration stands out as the

Zhang and Li

prime algorithm to tune the n(n + 1)/2 training parameters in Q to learn the LQR policy and value function (Sutton and Barto, 2018; Bertsekas, 2019).

Curse of dimensionality. In the simulation, we varied the system dimension n from 2 to 20. For each n, we independently ran the value iteration 5 times with random initial conditions. The simulation results are shown in Figure 1. In particular, Figure 1a shows the number of training parameters (top panel) and the average computational time over the 5 runs of the value iteration versus N (bottom panel). From these plots, we observe dramatic increases in both quantities. In addition to these common phenomena associated with the curse of dimensionality, we also observed convergence issues in this high-dimensional system policy learning problem.

Convergence issue. Note that by the deﬁnition of Riemann integrals (Rudin, 1976), for each t, the term 1

n Pn i=1 x2 i (t) in the state-value function V is essentially a Riemann sum of the real-value function 1

2x(t, ) over the interval [ 1, 1]. Therefore, by the Dominated Convergence theorem (Folland, 2013), V possesses the convergence property,

V (x(t)) = Z

t e 2.5th 1

i=1 x2 i (t) + u2(t) i dt Z

1 x2(t, β)dβ + u2(t) i dt (2)

as N , where x(t, β) satisﬁes the linear parameterized system on R, given by

d dtx(t, β) = βx(t, β) + u(t), β Ω= [ 1, 1]. (3)

To see that the value function also inherits this convergence property, we notice that the statevalue function is continuous, and hence upper semi-continuous, for any control policy u. This implies that the values function V (x(t)) = infu V (x(t)) is upper semi-continuous (Folland, 2013). Together with the non-negativity of V , we obtain the convergence of the state-value function, lim N infu V (x(t)) = infu lim N V (x(t)) = infu R t e 2.5t

2 h R 1 1 x2(t, β)dβ +

u2(t) i dt (see Theorem 3 below for the justiﬁcation that the right-hand side is well-deﬁned) (Folland, 2013). Let V n and u n denote the value function and LQR policy learned from the n-dimensional system. As shown in Figure 1b, neither V n V n 1 = supt 0 |V n (x (t)) V n 1(x (t))| nor u n u n 1 = supt 0 |u n(t) u n 1(t)| shows a trend of converging to 0, where x (t) denotes the optimal trajectory. This particularly fails to verify the convergence of the value function as illustrated in (2).

Reinforcement Learning for Infinite-Dimensional Systems

2 4 6 8 10 12 14 16 18 20 0

Number of parameters

2 4 6 8 10 12 14 16 18 20 0

Computing time

5 10 15 20 0

Projection error

Figure 1: Illustration of the challenges to RL for high-dimensional systems. The standard value iteration is applied to learn the LQR policy and value function for an n-dimensional time-invariant deterministic linear system, with N ranging from 2 to 20. (a) shows the number of training parameters (top) and average computing time (bottom) with respect to the system dimension n. (b) plots the average distance between successive terms in the sequences of the learned value functions (blue) and LQR policies (red) with respect to n.

To further elaborate on this convergence issue, we treat the parameterized system in (3) as a continuum ensemble of linear systems indexed by β [ 1, 1]. The ﬁnite ensemble in (1) can be seen as essentially a size-n sample of this inﬁnite ensemble. From this perspective, the convergence issue arises due to the ill-posed problem stemming from the sampling (discretizing) of the continuum ensemble. Speciﬁcally, the optimal policy for a size-n sample may be highly suboptimal for a size-m ensemble for m > n, so that um is signiﬁcantly diﬀerent from un. To carry out a quantitative analysis in the simplest setting, we consider the case of learning a policy that steers the entire inﬁnite ensemble to the origin 0 with minimal energy. Speciﬁcally, the cumulative reward function is given by V (t, x(t)) = R 1 t u2(t)dt. The minimum energy policy for the size-n sample is given by un(t) = B ne Ant W 1 n (0, 1)x0n so that Vn(t, x(t)) = x 0n W 1 n (t, 1)x0n, where Wn(t1, t2) = R t2 t1 e Ans Bn B ne A nsds for 0 t1 t2 1 is referred to as the controllability Gramian, and x0n is the initial condition of the size-n ensemble system (Brockett, 2015; Liberzon, 2012). It is not hard to see that the analytic properties of Vn and un primarily depend on those of Wn. Speciﬁcally, the (i, j)-entry of Wn(t1, t2) is given by e (ai+aj)t1 e (ai+aj)t2

Its trace, tr Wn(t1, t2) = Pn i=1 e 2ait1 e 2ait2

2ai , satisﬁes 1 e 2

ntr Wn(t1, t2) e2 1

Hence, all the eigenvalues of W(t1, t2) are bounded below and above by λmin = 1 e 2

2 and λmax = e2 1

2 , respectively. As a result, for any diﬀerent sample sizes m and n, there are initial conditions x0n and x0m such that supt |Vn(t, x(t)) Vm(t, x(t))| λ 1 min λ 1 max = 2 and supt |un(t) um(t)| 2, highlighting the convergence issue. To overcome the presented challenges, we propose a new kernel parameterization technique that will serve as a foundational element for developing a novel RL architecture. To facilitate this, we will ﬁrst introduce a principled RL formulation that accounts for large-scale population systems, regardless of their size.

Zhang and Li

2.2 Reinforcement learning for parameterized systems on function spaces

The parameterized representation in (3) of the linear agents population in the limiting case inspires the modeling of agent populations of any size as parameterized diﬀerential equation systems, also referred to as ensemble systems, of the form

d dtx(t, β) = F(t, β, x(t, β), u(t)), (4)

where β is the system parameter taking values in Ω Rd, x(t, β) M is the state of the system indexed by the particular value of β (the environmental state of agent β) in the population with M Rn being a diﬀerentiable manifold, u(t) Rr is the control policy, and F(t, β, , u(t)) is a (time-varying) vector ﬁeld on M for each β Ωand u(t) Rr, characterizing the environment of the agent β. Similarly, as motivated by the Riemann sum convergence illustrated in (2), we associate the parametrized ensemble system in (4) with a state-value function in the integral form, given by

V (t, xt) = Z

t r(s, x(s, β), u(s))ds + K(T, x(T, β)) i dβ, (5)

where xt( ) .= x(t, ), and r(s, x(s, β), u(s)) and K(T, x(T, β)) represent the running and terminal costs of agent β, respectively. One of the major advantages of the parameterized system formulation is the ﬂexibility to model agent populations of arbitrarily large size, extending to a continuum of agents when the parameter space Ωhas an uncountable cardinality, as in the case of (3). However, due to practical limitations on sensing capability and computing power, it is impossible to collect comprehensive measurement data documenting the state and reward information for all the agents in such a massive-scale agent population. This constraint particularly hinders the design and implementation of feedback control policies, placing the relevant policy learning problem beyond the scope of many existing RL and optimal control methods. To develop a new RL paradigm suitable for this type of policy learning tasks, we reframe the parameterized system in (4) and its state-value function in (5) from a diﬀerent perspective. Speciﬁcally, the ensemble state of the parameterized system can be seen as a function x(t, ) : Ω M, such that the system is evolving on a space F(Ω, M) of M-valued functions deﬁned on Ω. Thus, the state-value function is a functional on F(Ω, M), essentially characterizing the average of the cumulative rewards of all the agents over the entire population. This functional viewpoint, in turn, situates the policy learning tasks within the function space F(Ω, M). When Ωis an inﬁnite set, F(Ω, M) becomes an inﬁnite-dimensional manifold. Therefore, to ensure the solvability of these policy learning tasks, some regularity conditions on the system dynamics F, immediate reward r, and terminal cost K, more stringent than the canonical setup in RL, are required.

Assumption S1. (Boundedness of control policies) The ensmeble control policy u : [0, T] Rr is a measurable function and takes values in a compact subset U of Rr.

Assumption S2. (Lipschitz continuity of system dynamics) The vector ﬁeld F : R Ω M U TM is continuous in all variables and Lipschitz continuous in z M uniformly for

Reinforcement Learning for Infinite-Dimensional Systems

(t, β, a) [0, T] Ω U. Speciﬁcally, there exists a constant C (independent of t [0, T], β Ω, and a U) such that |(ϕ F|V )(t, β, ϕ(p), a) (ϕ F|V )(t, β, ϕ(q), a)| C|ϕ(p) ϕ(q)| for any p, q V and coordinate chart (V, ϕ) on M. Here, TM denotes the tangent bundle of M, ϕ F|V denotes the pushforword of F|V , the restriction of the vector ﬁeld F on V M, and | | denotes a norm on Rn.

According to the theory of ordinary diﬀerential equations, Assumption S2 guarantees that, driven by any admissible control policies, each individual system in the ensemble indexed by β in (4) has a unique and Lipschitz continuous solution t 7 x(t, β) (Arnold, 1978; Lang, 1999). Correspondingly, at the population level, where β varies over Ω, the ensemble system has a unique solution in F(Ω, M), given by t 7 x(t, β).

Assumption C1. (Integrability of the state-value function) There exists an ensemble control policy u U such that R

Ω R T 0 |r(t, x(t, β), u(t))|dtdβ + R

Ω|K(T, x(T, β))|dβ < , where x(t, ) F(Ω, M) is the solution of the ensemble system in (4) driven by u.

Assumption C2. (Lipschitz continuity of private costs) Both the running cost r : R M U R and terminal cost K : R M R are continuous functions in all variables and Lipschitz continuous in z M for any (t, a) [0, T] U.

In the sequel, these regularity assumptions will be exploited to prove the existence of a solution to the functional policy learning problem formulated in (4) and (5). In other words, the value function V (t, xt) = infu V (t, xt) is a well-deﬁned real-valued function on [0, T] F(Ω, M). To emphasize the dependence of the state-value function V on the policy u, we denote the total cost by V (0, x0) = J(u), which deﬁnes a function J : U R, referred to as the cost functional, with U = {u : [0, T] Rr | u(t) U} being the space of admissible policies. The solvability of the learning problem then hinges on proving the compactness of U and continuity of J.

Lemma 1 (Compactness of admissible policies) The space of admissible policies U is compact.

Proof Topologically, U, equipped with the topology of pointwise convergence, is the product space Q t [0,T] U under the product topology. Because U Rm is compact by Assumption S1, the compactness of U directly follows from Tychonoﬀ s theorem (Munkres, 2000).

Lemma 2 (Sequential continuity of the cost functional) The cost functional J : U R is sequentially continuous, that is, J(uk) J(u) for any admissible policy sequence (uk)k N such that uk u in U.

Proof See Appendix A.

Theorem 3 (Existence of optimal policies) Given a parameterized ensemble system deﬁned on the function space F(Ω, M) as in (4) satisfying Assumptions S1 and S2, and the

Zhang and Li

state-value function deﬁned in (5) satisfying Assumptions C1 and C2, the value function V : [0, T] F(Ω, M) R, given by V (t, xt) = infu V (t, xt), is well-deﬁned. This implies that an optimal policy exists.

Proof By the uniqueness of the solution to the parameterized ordinary diﬀerential equation in (4), guaranteed by Assumptions S1 and S2, if there is a policy u U that minimizes J, then it necessarily minimizes V (t, xt) for all t [0, T] as well. In addition, we have J(u ) = minu U J(u) = V (0, x0) by the deﬁnition of J, which is unique as the inﬁmum (greatest lower bound) of the set J(U) R (Rudin, 1976). As a result, V (0, x0), and hence V (t, xt) for each t [0, T], has a deﬁnite real value, indicating that V is a well-deﬁned (i.e., single-valued) real-valued function on [0, T] F(Ω, M). Therefore, it suﬃces to show the existence of u U.

Without loss of generality, we assume that the integrability condition in Assumption C1 is satisﬁed for all u U. We then show that the range J(U) of the cost functional J is a compact subspace of R, equivalently, any sequence in J(U) has a convergent subsequence, because of J(U) R (Munkres, 2000). To show this, we pick an arbitrary sequence (Jk)k N in J(U) and claim that any policy sequence (uk)k N satisfying Jk = J(uk) for all k N has an accumulation point u U. To see this, we deﬁne Tk = {un : n k}. Then, T k N T k = holds, where T n denotes the closure of Tn. Otherwise, {U\T k}k N forms an open cover of U, which has a ﬁnite subcover, say U = TN k=1(U\T k), due to the compactness of U shown in Lemma 1, leading to the contradiction = U\ TN k=1(U\T k) = TN k=1 T k = T N. Now, let u T k N T k and W be a neighborhood of u. Then W Tk = for every k, i.e., W contains some uk for any arbitrarily large k, and hence u is necessarily an accumulation point of the sequence (uk)k N. The sequential continuity of J, proved in Lemma 2, implies that J necessarily maps accumulation points of policy sequences to accumulation points in J(u). As a result, there is a subsequence of (Jk)k N that converges to J(u), showing the sequential compactness, and hence compactness, of J(U).

The compactness of J(U) particularly implies that J = inf J(U) = infu U J(u) J(U). Therefore, there exists u U such that J(u ) = J , meaning u is an optimal policy.

Remark 4 (Topological actor-critic algorithm) Although Theorem 3 veriﬁes the solvability of the policy learning problem over the inﬁnite-dimensional function space from a theoretical perspective, the main idea of the employed topological argument aligns closely with the actor-critic algorithm in RL. Speciﬁcally, the critic J constantly evaluates the actor u to iteratively improve its performance, generating a sequence of cost Jk converging to the minimal cost J . The corresponding policy u with J = J(u ) is then a desired optimal policy. Moreover, importantly, approaching u through a cost sequence instead of a policy sequence deliberately avoids a technical issue. Although U is compact, a policy sequence uk in U may not contain any convergent subsequence since U is not a ﬁrst-countable space (Munkres, 2000).

Reinforcement Learning for Infinite-Dimensional Systems

3. Reinforcement Learning for Parameterized Systems via Moment Parameterization

In this section, we will focus on developing an RL framework for learning optimal policies for parameterized ensemble systems deﬁned on an inﬁnite-dimensional function space. Our initial step, which is also essential to most learning problems, is to explore an appropriate parameterization for the learning targets. To this end, we will introduce a moment kernel transform, which generates kernel representations of parameterized systems and state-value functions over a reproducing kernel Hilbert space (RKHS).

3.1 Moment kernelization of parameterized systems and value functions

Our theoretical development is based on leveraging and extending the method of moments in functional analysis and probability theory (Mackey, 1980). The central idea is to represent the time-varying state functions of a parameterized system as time-dependent sequences of real numbers. To formalize this idea, we impose the following assumption:

Assumption K1. The state space F(Ω, M) of the parameterized ensemble system in (4), given by d

dtx(t, β) = F(t, β, x(t, β), u(t)) with β Ω Rd, is a Hilbert space H contained in L2(Ω, Rn), the space of Rn-valued square-integrable functions deﬁned on Ω.

Ensemble moments and moment kernel transform. To motivate our idea of dynamic moment kernelization, we consider the scalar-valued parameterized system deﬁned on the Hilbert space H0 L2(Ω, R); namely, the ensemble state x(t, ) .= xt( ) H0. Because L2(Ω, R) is separable, H0 is also separable as a linear subspace of H (Yosida, 1980). Hence, H0 possesses a countable orthonormal basis, denoted {Φk}k N. We deﬁne the kth moment of the parameterized system for each k N with respect to Φk as

mk(t) = Φk, xt , (6)

where , : H0 H0 R denotes the inner product in H0. Then, the moment sequence m(t) M0 associated with xt H0 is denoted by m(t) = (mk(t))k N, with M0 being the space of all moment sequences, referred to as the moment space. In addition, the inner product on H0, as a subspace of L2(Ω, R) illustrated in Assumption K1, is speciﬁcally given by mk(t) = Φk, xt = R

ΩΦk(β)xt(β)dβ. The choice of {Φk}k N as an orthonormal basis yields P k N |mk(t)|2 = m(t) M0 = xt H0 = R

Ω|xt(β)|2dβ < by Parseval s identity (Folland, 2013), where M0 and H0 denote the norms on M0 and H0, respectively. This implies that M0 is contained in the ℓ2-space, consisting of square-summable sequences. Essentially, mk(t) is the kth Fourier coeﬃcient of the function xt H0, and thus the moment sequence m(t) provides a coordinate representation of xt with respect to the basis {Φk}k N. In the general case where n > 1, the state space H of the parameterized system admits a decomposition as a direct sum of n copies of H0, i.e., H = H0 H0. Equivalently, each component xi t of the ensemble state xt = x1 t , . . . , xn t is an element of H0. As a result, the deﬁnition of moments in (6) can be extended to xt H in a component-wise manner as mk(t) = Φk, xt = Φk, x1 t , . . . , Φk, xn t . Then, mk(t) becomes an Rn-valued sequence so that the moment space M = M0 M0 is a Hilbert subspace of ℓ2 ℓ2, the n-fold direct sum of ℓ2-spaces. More importantly, this indicates that M is an RHKS as shown below.

Zhang and Li

Proposition 5 The moment space M is an RKHS on N, and the moment kernel transform K : H M, given by xt 7 m(t), is an isometric isomorphism between the Hilbert spaces H and M.

Proof Following the analysis above, we know that the moment space M0 of functions in H0 is contained in ℓ2. It remains to show that ℓ2 M0. To this end, we pick any m(t) ℓ2. According to the Pythagorean theorem, we have P k=0 mk(t)Φk = P k=0 |mk(t)|2 < , implying that the partial sum PN k=0 mk(t)Φk forms a Cauchy sequence. Therefore, xt = P k=0 mk(t)Φk is a well-deﬁned element (function) in H0 and satisﬁes Φk, xt = mk(t), yielding ℓ2 M0. This concludes that M0 = ℓ2. The restriction of the moment transform to each component of xt H is thus an isometric isomorphism from H0 to M0. Together with the fact that H = H0 H0, we obtain M = M0 M0 = ℓ2 ℓ2, and the moment kernel transform from H to M is an isometric isomorphism as well. Lastly, to show that M is an (Rn-valued) RKHS on N, it suﬃces to demonstrate that the point evaluation map Ek : M Rn, given by m(t) 7 mk(t), is bounded (Paulsen and Raghupathi, 2016). This follows from the estimate,

i=1 | Φk, xi t |2

i=1 Φk 2 H0 xi t 2 H0 =

i=1 xi t 2 H0 = xt 2 H = m(t) 2 M,

where | |, H, and M denote the norms on Rn, H, and M, respectively.

Moment kernelization of ensemble systems. Having kernelized the ensemble state of the parametrized system, the next step is to exploit the kernelized state and the moment transform to kernelize the system dynamics, i.e., the temporal evolution of the ensemble state over H. Intuitively, this requires taking the time-derivative of the moments, which yields a diﬀerential equation system governing the evolution of the moment sequence, given by d dtmk(t) = d

dt Φk, xt = D Φk, d

dtxt E = Φk, F(t, , xt, u(t)) . (7)

Here, the change of the order of the time-derivative and the inner product operation follows from the dominated convergence theorem (Folland, 2013). Note that because the statespace H of the parameterized system is a vector space, the vector ﬁeld F(t, β, xt(β), u(t)) governing the system dynamics, considered as a function in β, is an element of H as well. Following the deﬁnition in (6), we observe that Φk, F(t, , xt, u(t)) in (7) is essentially the kth moment of F(t, , xt, u(t)). Let F(t, m(t), u(t)) denote the moment sequence of F(t, , xt, u(t)) = F(t, , K 1m(t), u(t)). We obtain a concrete representation of the moment kernelized ensemble system deﬁned on M as d dtm(t) = F(t, m(t), u(t)). (8)

Remark 6 Note that the moment kernelized system in (8) consists of countably many components, even though the parameterized ensemble system in (4) may be composed of a continuum of (uncountably many) intelligent agents. This implies that the moment kernelization not only deﬁnes a kernel parameterization but also provides model reduction for parameterized systems. This feature will be fully exploited in the development of the ﬁltrated RL architecture for learning optimal policies of parameterized systems.

Reinforcement Learning for Infinite-Dimensional Systems

Moment kernelization of value functions. The ﬁnal step in exploring the moment kernel representation of the proposed policy learning problem is to kernelize the state-value function V : [0, T] H R. To achieve this, supported by the integrability condition in Assumption C1, we apply Fubini s theorem to change the order of the integrals in the ﬁrst summand of V , resulting in V (t, xt) = R T t R

Ωr(s, β, x(s, β), u(s))dβds+ R

ΩK(T, β, x(T, β))dβ (Folland, 2013). Observe that R

Ωr(t, β, x(t, β), u(t))dβ and R

ΩK(T, β, x(T, β))dβ are essentially the 0th moments of r(t, β, xt, u(t)) and K(T, β, x T ) as real-valued functions deﬁned on Ω, provided that Φ0 is a constant function. Denoting these by r(t, m(t), u(t)) and K(T, m(T)), respectively, we obtain the desired moment kernelized state-value function as

V (t, m(t)) = Z T

t r(s, m(s), u(s))ds + K(T, m(T)). (9)

Remark 7 The key step in kernelizing the state-value function above is changing the order of the integrals with respect to β Ωand t [0, T] using the integrability condition in Assumption C1. This derivation remains valid under a relatively weaker condition, namely, that the immediate reward function r is nonnegative for any admissible control policy, as a consequence of Tonelli s theorem (Folland, 2013). This nonnegative condition also occurs commonly in practice, e.g., the LQR problem presented in Section 2.1, demonstrating the general applicability of the proposed moment kernelization approach.

As the pointwise inﬁmum of the state-value function over the space of admissible policies, the value function naturally admits the moment kernel representation V : [0, T] M R, given by

V (t, m(t)) = inf u U V (t, m(t)) = inf u U

t r(s, m(s), u(s))ds + V (t + h, m(t + h)) o (10)

for any h > 0 such that t + h T, where the second equality follows from the dynamic programming principle (Evans, 2010). The integral representation of the value function in (10) leads to the following regularity property, which plays a crucial role in establishing RL approaches to policy learning of parameterized systems over the moment domain.

Proposition 8 The value function V : [0, T] M R is Lipschitz continous.

Proof See Appendix B.

3.2 Moment convergence for reinforcement learning of kernelized ensemble systems

As the moment kernelized system in (8) consists of countable state variables, this enables the use of truncated moment systems to facilitate RL of parameterized systems. This section is dedicated to laying the theoretical foundation for this approach. In particular, the main focus is to show that the value functions of truncated moment systems converge to those of the moment kernelized parameterized systems in an appropriate sense.

Zhang and Li

To rigorously formulate the corresponding policy learning problem for a truncated moment system, we use the hat notation ˆ to denote the truncation operation and identify the order-N truncated moment sequence with the projection of the inﬁnite moment sequence onto the ﬁrst N components, e.g., bm N(t) = (m0(t), . . . , m N(t), 0, 0, . . . ) and b FN = ( F0, . . . , FN, 0, 0, . . . , ) . This then constructs the order-N truncated moment system as d dt bm N(t) = b FN(t, bm N(t), u(t)), (11)

which is a control system deﬁned on c MN = {ˆz N M | z M}, the space of all order N truncated moment sequences. With a slight abuse of the hat notation, we denote the state-value and value functions of the system in (11) by b VN[0, T] c MN R and b V N : [0, T] c MN R, respectively. Clearly, the dynamic programming principle is also satisﬁed by this truncated value function as

b V N(t, bm N(t)) = inf u U b VN(t, bm N(t)) = inf u U

t r(s, bm N(s), u(s))ds + K(T, bm N(T)). (12)

The RKHS structure on the moment space M implies the convergence of the truncated moment sequence bm N(t) to the entire moment sequence m(t) as N . This can be observed by the vanishing of the truncation error as

m(t) bm N(t) 2 =

k=N m2 k(t) 0 (13)

as N , because mk 0 as k following from m(t) being square summable. We then have

F(t, m(t), u(t)) b FN(t, bm N(t), u(t)) F(t, m(t), u(t)) b FN(t, m(t), u(t))

+ b FN(t, m(t), u(t)) b FN(t, bm N(t), u(t)) 0.

Here, because F is also a moment sequence as discussed previously, the convergence of the ﬁrst term to 0 is essentially the same as (13). Combined with the continuity of b FN, this ensures the convergence of the second term to 0. To reinforce the convergence property from truncated moment sequences to truncated value functions, it is crucial to be aware of the fact that b V N is barely the restriction of b V on the subspace c MN M, which we denote by V | c MN. This is because the state

trajectory of the order-N truncated moment system in (11) must stay within c MN, while the optimal trajectory of the entire moment system (8), starting from an initial condition in c MN, may leave c MN. Therefore, it is necessary that b V N t, bm N(t) V | c MN t, bm N(t) , but

this minor discrepancy will not preclude the desired convergence of b V N to V as N . To make this convergence argument mathematically rigorous, we extend the domain of b V N from [0, T] c MN to [0, T] M by deﬁning b V N(t, m(t)) .= b VN(t, bm N(t)).

Theorem 9 (Moment convergence of value functions) The value function b V N

N N of the truncated moment system in (11) converges locally uniformly to V , the value function of the moment system in (8), on [0, T] M as N .

Reinforcement Learning for Infinite-Dimensional Systems

Proof Because the spaces of truncated moment sequences form an ascending chain of subspaces of the moment space, meaning c M0 c M1 M, the sequence of value functions b V N

N N forms a decreasing chain b V 0 t, z b V 1 t, z V t, z for any

z M. This particularly implies that b V N

N N is locally uniformly bounded, i.e., there

exists a neighborhood N of (t, z) in [0, T] M and a real number M so that b V N(s, w) M for all (s, w) N and N N with M independent of (s, w) and N. On the other hand, by the aforementioned fact that b V N V , together with the Lipschitz continuity of V shown in Lemma 8, all the functions in the sequence b V N

N N are Lipschitz continuous and have the same Lipschitz constant equal to that of V . Consequence, this sequence of value functions is equicontinuous Rudin (1976). A direct application of the Arzelà Ascoli theorem then shows that b V N V on N uniformly as N (Folland, 2013). Since (t, z) [0, T] M is arbitrary, we obtain b V N V locally uniformly on [0, T] M as N as desired.

According to the dynamic programming principle illustrated in (10), value functions are obtained by evaluating the corresponding state-value functions along optimal trajectories. It then becomes intuitive that, accompanied by the convergence of the truncated value function, the optimal trajectory of the truncated moment system also converges to that of the entire moment system.

Theorem 10 (Moment convergence of optimal trajectories) Let bm N(t) and m (t) be the optimal trajectories of the order-N truncated moment system and the entire moment system in (11) and (8), respectively. Then, bm N(t) m (t) on M as N for all t [0, T]. Equivalently, bx N(t, ) x (t, ) on H as N for all t [0, T], where bx N(t, ) and x (t, ) are the trajectories of the parameterized system in (4) driven by the optimal control policies for the truncated and entire moment systems, respectively.

Proof By the dynamic programming principle in (10), the optimal trajectory m (t) on [0, T] remains optimal when restricted to any subinterval [s, T] for 0 < s < T, and the same result holds for bm N(t) for each truncation order N N. Because the sequence of real numbers b VN(t, bm N(t))

N N is monotonically decreasing and bounded from below by V (t, m (t)) as shown in the proof of Theorem 9, it is necessary that b VN(t, bm N(t)) V (t, m (t)) as N for each t [0, N] (Rudin, 1976). Together with the continuity of V and the locally uniform convergence of the value function sequence b VN

N N by Theorem 9, we

obtain V (t, m (t)) = lim N b VN t, bm N(t) = V t, lim N bm N(t) so that bm N(t) m (t) on M as N . The isometrically isomorphic property of the moment kernel transform K : H M then implies that K 1 bm N(t) K 1m (t), i.e., bx N(t, ) x (t, ), on H as N .

A fundamental property of value functions crucial to RL is that they are viscosity solutions of Hamiltonian-Jacobi-Bellman (HJB) equations (Evans, 2010), which enables the design of RL algorithms to learn optimal control policies. The moment convergence shown in Theorems 9 and 10 then leads to an extension of HJB equations to value functions deﬁned on inﬁnite-dimensional spaces.

Zhang and Li

Corollary 11 (Moment convergence of Hamilton-Jacobi-Bellman equations) The value function V : [0, T] M R is the unique viscosity solution of the Hamilton Jacobi-Bellman equation, given by

t V (t, z) + min a U

n DV (t, z), F(t, z, a) + r(t, z, a) o = 0, (14)

on (0, T) M with the boundary condition V = K on {t = T} M, where DV (t, z) is the (Gateaux) diﬀerential of V with respect to z M.

Proof The uniqueness directly follows from the Lipschtiz continuity of the Hamiltonian H(t, z, p) = mina U p, F(t, z, a) + r(t, z, a) in (z, p) M M uniformly in t [0, T], as a consequence of Assumption C2, where M is the dual space of M (Evans, 2010). Then, it remains to examine whether V is a viscosity solution of the ﬁrst-order partial diﬀerential equation in (14). To show this, for any (t, z) [0, T] M, we pick a continuously diﬀerentiable function v : [0, T] M R such that V v has a local maximum at (t, z). Without loss of generality, we can assume (t, z) to be a strict local maximum; that is, there exists a neighborhood N whose closure N contains (t, z) such that V (t, z) v(t, z) > V (s, w) v(t, z) for any (s, w) N\{(t, z)}. Let (t N, z N) be a maximum of b V N v on N. Then, we have

tv(t N, z N) + min a U

n Dv(t N, z N), b FN(t N, z N, a) + r(t N, z N, a) o 0 (15)

following from the fact that b V N is the value function for he truncated moment system in (11) deﬁned on the ﬁnite-dimensional space c MN (Evans, 2010). By passing to a subsequence and shrinking the neighborhood N if necessary, we have (t N, z N) (et, ez) as N for some t [0, T] and ez N. This leads to the convergence b V N(t N, z N) v(t N, z N) V (et, ez) v(et, ez) by the locally uniform convergence of b V N to V

shown in Theorem 9. Because b V N(t N, z N) v(t N, z N) b V N(s, w) v(s, w) for any (s, w) N and N N by the choice of (t N, z N), we obtain V (et, ez) v(et, z) V (s, w) v(s, w) for any (s, w) N by letting N . In particular, V (et, ez) v(et, ez) V (t, z) v(t, z). This shows that (et, ez) = (t, z) and hence (t N, z N) (t, z) as N . Passing to the limit as N in (15) yields

tv(t, z) + min a U Dv(t, z), F(t, z, a) + r(t, z, a) 0, (16)

where we use the uniform convergence of b FN to F, following from a similar proof as Theorem 9 by replacing b V N and V with b FN and F, respectively, and the continuity of the Hamitonian H(t, z, p) = mina U p, F(t, z, a) + r(t, z, a) . A similar argument shows that

tw(t, z) + min a U Dw(t, z), F(t, z, a) + r(t, z, a) 0 (17)

if V w attains a local minimum at (t, z) [0, T] M. The equations in (16) and (17) imply that V is a viscosity solution of the Hamilton-Jacobi-Bellman equation in (14).

Reinforcement Learning for Infinite-Dimensional Systems

3.3 Moment Kernelization of Parameterized Systems in Stochastic Environments

The proposed parametrized model formulation and moment kernelization framework naturally extend to stochastic settings. We explore the adoption of our approach in three types of stochastic environments.

Model uncertainty. The parametrized ensemble system in (4) also provides a natural formulation for describing systems with model uncertainties. In this interpretation, the system parameter β characterizes uncertainty and can be regarded as a random variable drawn from a probability distribution P on the parameter space Ω. Consequently, the trajectory of the parameterized system xt is a stochastic process on the probability space (Ω, P), and the value function becomes

V (t, xt) = E Z T

t r(s, xs, u(s))ds + K(T, x T ) = Z

t r(s, xs, u(s))ds + K(T, x T ) d P,

where E denotes the expectation with respect to the probability measure P. Suppose that xt possesses ﬁnite variance for all t, then the kth moment of the parameterized system can be deﬁned as

mk(t) = Φk, xt = E Φkxt = Z

Ω Φkxtd P, (18)

where {Φk}k N is an orthonormal basis of L2(Ω, P). Leveraging this deﬁnition of moments, the kernelized system and value function follow the same forms as in (8) and (9), respectively. This implies that all the developments and conclusions in Sections 3.1 and 3.2 carry over to the stochastic setting with model uncertainties.

Background noise. Background noise is frequently inherent in the agents environments and typically appears as additive noise present in measurements of the states of agents. Here, we consider two types of background noise commonly encountered in practice: independent noise and common noise.

Independent noise: In this case, each agent in the parameterized family in (4) experiences parameter-dependent independent additive noise of the form xt(β) + ε(β), where {ε(β)}β Ωis a family of pairwise independent random variables. Suppose that ε(β) has zero mean and ﬁnite variance for all β Ω. Then, Fubini s theorem (Folland, 2013) implies that

Ω ε(β)dβ 2 = E Z

Ω ε(γ)dγ = E Z

Ω2 ε(β)ε(γ)dβdγ

Ω2 E ε(β)ε(γ) dβdγ = Z

D E ε2(β) dβdγ + Z

Ω2\D E ε(β) E ε(γ) dβdγ.

Here, D = {(β, γ) Ω2 : β = γ} is the diagonal subset of Ω2 = Ω Ω, and the last equality follows from the pairwise independence of {ε(β)}β Ω. We further observe that D has Lebesgue measure 0, and together with E(ε2(β)) < , this leads to R

D E ε2(β) dβdγ = 0. On the other hand, the zero mean property gives

Zhang and Li

Ω2\D E ε(β) E ε(γ) dβdγ = 0. As a result, we have E R

Ωε(β)dβ 2 = 0, indicating

Ωε(β)dβ = 0 almost surely (a.s.).

Following the same deﬁnition as in (6), the kth moment satsiﬁes

mk(t) = Φk, xt + ε = Z

Ω Φk(β)xt(β)dβ + Z

Ω Φk(β)ε(β)dβ = Φk, xt a.s.

for all k N. Here, we use the claim that R

ΩΦk(β)ε(β)dβ = 0 a.s. To see this, consider that because of the square-integrability of Φk, the set L = {β Ω: Φk(β) = } has Lebesgue measure 0. This gives R

ΩΦk(β)ε(β)dβ = R

L Φk(β)ε(β)dβ+ R

Ω\L Φk(β) 0dβ = 0 a.s., as claimed.

Common noise: In this case, each agent in the parameterized ensemble system in (4) experiences a common noise ε, so that its environmental state follows the form xt(β)+ε. Suppose that ε is a random variable with zero mean and ﬁnite variance, then the kth

moment is given by

mk(t) = Φk, xt + ε = Z

Ω Φk(β)xt(β)dβ + ε Z

is a random variable with the mean given by

E mk(t) = Z

Ω Φk(β)xt(β)dβ + Z

Ω Φk(β)dβ E(ε) = Z

Ω Φk(β)xt(β)dβ = Φk, xt .

Statistically, this implies that the moments of the parameterized system in the presence of common background noise are an unbiased estimator of the moment without noise. More importantly, the moment kernelization reduces the variance. By Hölder s inequality, we have

Var(mk(t)) = Z

Ω Φk(β)dβ 2 Var(ε) |Ω| Φk 2Var(ε) < Var(ε),

provided, without loss of generality, the Lebesgue measure of Ωsatisﬁes |Ω| < 1.

Stochastic dynamics. When the parameterized system in (4) is driven by a noise process, e.g., a Brownian motion Wt on a probability space (Rn, P), the evolution of its state obeys the stochastic diﬀerential equation

dxt(β) = F(t, β, xt(β), u(t))dt + G(t, β, xt(β))d Wt,

with the state-value function given by

V (t, xt) = E Z

t r(s, xs, u(s))ds + K(T, x T ) dβ ,

where F and G satisfy the Lipschitz and linear growth conditions in the system state variable, F(t, β, x, u) F(t, β, y, u) + G(t, β, x, u) G(t, β, y) K1 x y and F(t, β, x, u) 2 + G(t, β, x, u 2 K2(1 + x 2), respectively, uniformly in (t, β, u) [0, T] Ω U. These

Reinforcement Learning for Infinite-Dimensional Systems

conditions guarantee that, for each β Ω, the system trajectory is a ﬁnite variance stochastic process on the probability space (Rn, P) (Øksendal, 2003). Following the deﬁnition in (18), the moments of this parameterized stochastic system can be deﬁned as

mkl(t) = E ΦkΨl xt = E Z

Ω Φk(β)Ψl(xt(β))dβ

for all (k, l) N2, where (Ψl)l N is an orthonormal basis of L2(Rn, P). Note that, in this case, the moment sequence (mkl(t))k,l N is a deterministic double sequence, demonstrating again the ability of the moment kernel transform to mitigate the noise eﬀect. It should be noted that, within a stochastic environment, the open-loop requirement for the control policy may cause the loss of the Markov property in the state evolution of the parameterized system. This severely limits the applicability of MDP-based RL algorithms to the parameterized system, underscoring the necessity for the proposed moment kernelization technique to reduce randomness involved in the system. In the next section, we will develop a new RL architecture that fully exploits the algebraic structure of moment kernelized systems to facilitate policy learning for parameterized systems.

4. Filtrated Reinforcement Learning Architecture for Policy Learning of Parameterized Ensemble Systems

Building upon the theoretical foundations established in the previous sections, we now turn our attention to algorithmic approaches to policy learning in parameterized ensemble systems evolving on inﬁnite-dimensional function spaces. In this section, we will develop a novel RL architecture with eﬀective algorithms by organizing truncated moment kernelized systems with increasing truncation orders into a ﬁltrated structure. We will adopt spectral sequence techniques to conduct a convergence analysis of the proposed ﬁltrated RL (FRL) algorithms. Additionally, we will quantitatively investigate the computational and sample eﬃciency of FRL. The performance and eﬃciency of the FRL algorithms will be demonstrated using examples arising from practical applications and compared with baseline deep RL models.

4.1 Filtrated policy search for moment kernelized systems

After transforming the parameterized ensemble system to the moment domain, it is evident that an order-N truncated moment system contains any truncated moment system with a lower truncation order N < N as a subsystem. From a learning perspective, the corresponding RL problem for the order-N truncated moment system is a subproblem of that for the order-N truncated moment system. Generally, along any increasing sequence of truncation orders N0 < N1 < , a ﬁltration consisting of an ascending chain of RL problems is revealed. To further elaborate on this FRL architecture, starting from an initial moment truncation order N0, an optimal policy u N0(t) and the value function b V N0(t, bm N0(t)) can be learned for the order-N0 truncated moment system. Next, the truncation order is increased to N1 to learn u N1(t) and b V N1 t, bm N1(t) for the order-N1 truncated moment system. Continuing this procedure, we generate a sequence of control policy u Nk(t)

value function b V (t, bm Nk(t))

k N pairs. Guaranteed by the moment convergence proved in Theorem 10, the value function sequence b V (t, bm Nk(t))

k N necessarily converges to the

Zhang and Li

Order N0 truncation RL

Order N1 truncation RL

Order Nk truncation

Hierarchical

Parameterized

Moment kernelization

Figure 2: Workﬂow of FRL for parameterized ensemble systems deﬁned on inﬁnite-dimensional function spaces.

value function V (t, m(t)) of the entire moment system. Algorithmically, when the projection error supt [0,T] b V Nk t, bm Nk(t) b V Nk 1 t, bm Nk 1(t) < ε is satisﬁed for a prescribed tolerance ε > 0, the optimal policy u Nk(t) of the order-Nk truncated moment system provides a suﬃciently good approximation to the optimal policy u (t) of the entire moment system, equivalently the parameterized ensemble system. The workﬂow of this FRL approach is shown in Figure 2.

Remark 12 (Optimality preserving ﬁltration) The observation made above Theorem 9, namely, b V N V |c MN for any moment truncation order N N, directly extends to the

established RL ﬁltration; speciﬁcally b V Ni b V Nj|c MNi b V Nj for any i < j. This means that the optimal policy learned at one hierarchy of the ﬁltrated RL problem is also optimal for, and may even decrease the values of, the cumulative reward function learned from all lower-level hierarchies. This demonstrates that moving up the hierarchy in the RL ﬁltration preserves the optimality of each preceding hierarchy.

A direct observation of the truncated and entire moment systems in (11) and (8) reveals that they are regulated by the same control policy. This highlights policy search algorithms as prime candidates for learning the optimal policy at each hierarchy in the RL ﬁltration. Moreover, the optimality-preserving property revealed in Remark 12 strongly suggests a speciﬁc algorithmic approach: the policy learned from the current hierarchy is always guaranteed to be a good initial condition for the successive hierarchy. Consequently, as the truncation order increases, the initial condition becomes closer to the optimal solution, eﬀectively reducing the computational cost for leaning optimal policies of high-dimensional systems. This ﬁltrated policy search algorithm is summarized in Algorithm 1.

Reinforcement Learning for Infinite-Dimensional Systems

Algorithm 1 Filtrated policy search for learning optimal policies for parameterized systems Input: Initial state x0, ﬁnal time T, projection error tolerance ε Output: Optimal policy u

Initialization: truncation order N0, initial policy u(0), projection error P > ε, hierarchy level i = 0

2: while P > ε do

if i > 0 then

4: i i + 1 Pick Ni > Ni 1 and u(0) u Ni 1 6: end if Generate data by solving the parameterized system with the input u(0) and collect the rewards

8: Compute order Ni truncated moment kernelization: bm Ni(t), b FNi(t, bm Ni(t), u(0)), r(t, bm Ni(t), u(0)), K(T, bm Ni(T))

Solve u Ni(t) = argminu b VNi(t, bm Ni(t)) and compute b V Ni(t, bm Ni(t)) = minu b V Ni(t, bm Ni(t))

10: Compute P = supt [0,T] b V Ni t, bm Ni(t) b V Ni 1 t, bm Ni 1(t) with b V N 1 0 end while

12: u u Ni return u

4.1.1 Spectral sequence convergence of filtrated policy search

Let bm(j) Ni(t) denote the state trajectory of the order-Ni truncated moment system driven

by the policy u(j) Ni resulting from the jth iteration of the policy search (PS) algorithm. The ﬁltrated policy search shown in Algorithm 1 then generates a spectral sequence, given by

PS iteration 0 PS iteration 1 PS iteration j Hierarchy 0 b VN0(t, bm(0) N0(t)) b VN0(t, bm(1) N0(t)) b VN0(t, bm(j) N0(t)) b V N0(t, bm N0(t)) Hierarchy 1 b VN1(t, bm(0) N1(t)) b VN1(t, bm(1) N1(t)) b VN1(t, bm(j) N1(t)) b V N1(t, bm N1(t)) ... ... ... ... ... Hierarchy i b VNi(t, bm(0) Ni (t)) b VNi(t, bm(1) Ni (t)) b VNi(t, bm(j) Ni(t)) b V Ni(t, bm Ni(t)) ... V (t, m(0)(t)) V (t, m(1)(t)) V (t, m(j)(t)) V (t, m(t))

In this sequence, the ith row converges to the value function b V Ni(t, bm Ni(t)) of order-Ni truncated moment system, and the jth column converges to the state-value function V (t, m(j)(t)) of the entire moment system. Speciﬁcally, the row convergence naturally follows from the convergence of the PS algorithm (Sutton and Barto, 2018), while the column convergence results from the continuity of the state-value function and the convergence of the truncated moment sequence to the inﬁnite moment sequence. The row and column convergence together imply a stronger convergence property of Algorithm 1, that is, spectral sequence convergence, as deﬁned and proved below.

Zhang and Li

Theorem 13 (Spectral sequence convergence of FRL) Let b VNi(t, bm(j) Ni)

i,j N be the spectral sequence generated by Algorithm 1. Then, this sequence converges to the value function V (t, m(t)) of the moment kernelized system, i.e., b Vi(t, bm(φ(i)) Ni ) V (t, m(t)) as i for any monotonically increasing function φ : N N.

Proof It suﬃces to show that both iterated limits, limi limi b VNi(t, bm(j) Ni) and limj limi b VNi(t, bm(j) Ni), exist and are equal to V (t, m(t)).

We ﬁrst compute limi limi b VNi(t, bm(j) Ni). For each j N, following a similar proof

as for Theorem 10 by replacing bm N(t) and m (t) with bm(j) Ni(t) and m(j)(t), respectively, we

obtain b VNi(t, bm(j) Ni) V (t, m(j)(t)). Next, the convergence of V (t, m(j)(t)) to V (t, m(t)) is essentially the convergence of the policy search algorithm (Sutton and Barto, 2018), concluding that limi limi b VNi(t, bm(j) Ni) = V (t, m(t)).

On the other hand, to compute limj limi b VNi(t, bm(j) Ni), we ﬁrst note that for each i, the convergence of b VNi(t, bm)Ni(t)(j)) to b V Ni(t, bm Ni(t)) as j again follows from the convergence of the policy search algorithm. Since each V Ni(t, bm Ni(t)), as a value function, is a viscosity solution of the HJB equation associated with the order-Ni truncated moment system, Corollary 11 implies that V Ni(t, bm Ni(t)) V (t, m(t)) as i . This shows that

limj limi b VNi(t, bm(j) Ni) = V (t, m(t)), concluding the proof.

Spectral sequence-enabled early stopping for FRL. Interpreted using the spectral sequence above, Algorithm 1 approaches the optimal policy u of the parameterized ensemble system along the rightmost column, meaning the learning sequence consists of the optimal policies of all the truncated moment systems. The spectral sequence convergence of FRL shown in Theorem 13 guarantees that learning sequences along any paths towards the bottom right element V all converge to u . Algorithmically, such a learning sequence is obtained by employing early stopping in the PS algorithm at each hierarchy. For example, in the extreme case, the learning sequence along the diagonal of the spectral sequence is generated by executing only one iteration of the PS algorithm at each hierarchy. However, this naive diagonal learning sequence is generally not eﬀective in terms of computational eﬃciency and learning accuracy. This motivates the exploration of a sophisticated early stopping criterion to fully exploit the advantage of the distinguished spectral sequence convergence of FRL. To collect some thoughts about the design of early stopping criteria, we devote our attention to policy gradient (PG) methods, which are commonly used policy search approaches in RL literature. In the context of the spectral sequence, a PG algorithm applied at the ith

hierarchy generates a policy sequence in the form of u(j) Ni = u(j 1) Ni + δu(j 1) Ni . The update rule

δu(j 1) Ni is generally chosen to be proportional to D b JNi(u(j 1) Ni ), the gradient of the total reward

function of the order-Ni truncated moment system evaluated at u(j 1) Ni , to ensure u(j) Ni u Ni. In practice, to keep the agents behavior under control, e.g., staying in the safety region, the PG algorithm is generally implemented in a clipped manner to bound the amplitude of δu(j) Ni, such as Proximal Policy Optimization (PPO) and Trust Region Policy Optimization (TRPO) (Schulman et al., 2017, 2015). Inspired by this, we choose the the early stopping criterion, for

Reinforcement Learning for Infinite-Dimensional Systems

moving to the successive hierarchy of FRL, to be a threshold (δ > 0) for the variation of the state-value function, such that supt [0,T] b VNk(t, bm(i) k (t)) b VNk(t, bm(i) k (t)) > δ. It is worth mentioning that the use of the state-value function threshold, instead of a policy threshold as in TRPO and PPO, takes into consideration the possible failure of convergence in the generated policy sequence, as pointed out at the end of Section 2. Before integrating the hierarchy-wise early stopping criterion into the FRL algorithm, we ﬁrst carry out a detailed investigation into the computational and sample eﬃciency of FRL.

Convergence rate. It is intuitive that the rate of convergence of the spectral sequence depends on both the row and column convergence rates. As described in the proof of Theorem 13, each row is generated by a standard policy search algorithm, and hence the rate of row convergence is entirely determined by the applied algorithm. Meanwhile, the column convergence is guaranteed by the moment convergence of the value function shown in Theorem 9, which is a consequence of the convergence of the truncated moment sequence to the entire inﬁnite moment sequence. Hence, we ﬁrst evaluate the convergence rate of the truncate moment sequence. To this end, we note that bm N(t) m(t) 2 = P k=N+1 |mk|2

is essentially the tail of the moment sequence. Therefore, this convergence rate coincides with the rate of convergence of mk(t) to 0. Owing to the ℓ2-convergence of the moment sequence P k=0 |mk(t)|2 < , by the comparison test (Rudin, 1976), it is necessary that |mk(t)|2 < k 1 for large enough k since the harmonic series P k=1 k 1 fails to converge. As a result, the convergence rate of bm N(t) to m(t) is bounded above by O(N 1/2). To leverage this to compute the convergence rate of the column convergence, we have V (t, m(j)(t)) b VNi(t, bmj(t))

0 r(s, m(j)(s), u(s))ds + K(T, m(j)(s)) Z T

0 r(s, bm(j) Ni(s), u Ni(s))ds + K(T, m(j) Ni(s))

r(s, m(j)(s), u(s)) r(s, bm(j) Ni(s), u Ni(s)) ds + K(T, m(j)(s)) K(T, m(j) Ni(s))

0 Lr m(j)(s) m(j) Ni(s) ds + LK m(j)(T) m(j) Ni(T)

Lr TO(N 1/2 i ) + LKO(N 1/2 i ) O(N 1/2 i ),

where we used the Lipchitz continuity of the running cost r and terminal cost K, as presented in Assumption C2, with Lr and LK denoting their Lipchitz constants, respectively. We can now integrate the rates of the column and row convergence. Let O(α(N)) be the convergence rate of the policy search algorithm applied to the order-N truncated moment system, then we have V (t, m(t)) VN(t, bm( 1/α(N) ) N (t)) V (t, m(t)) V N(t, bm N(t))

+ V N(t, bm N(t)) VN(t, bm( 1/α(N) ) N (t) O(N 1/2) + O(α(N))

by the triangle inequality, where 1/α(N) denotes the smallest integer greater than or equal to 1/α(N). This concludes that, in the worst-case scenario, the convergence rate of the spectral sequence is bounded above by O(N 1/2) + O(α(N)). To further elaborate on it, if the stopping criterion is set to be ε, then we should choose N satisfying N 1/2 + α(N) < ε

Zhang and Li

as the moment truncation order and run 1/α(N) iterations of the policy search algorithm on the truncated moment system.

Computational and sample eﬃciency. As it is impractical to collect comprehensive measurement data from a parameterized ensemble system, practical implementations of the proposed FRL method require estimations of moments and moment systems using ﬁnite data points, even in cases where the mathematical model of the parameterized system is known. Suppose that only the states of q agents in the parameterized ensemble system in (4), say x(t, β1), . . . , x(t, βq), can be measured. We then deﬁne the sample moments by

mk(t) = |Ω|

i=1 Φk(βi)x(t, βi) (19)

for all k N, where |Ω| is the Lebesgue measure (volume) of Ω.

Proposition 14 Suppose that, at each time t, measurement data for q randomly selected systems in the parameterized ensemble in (4) are available. Then, the sample moments satisfy mk(t) mk(t) for all k N as q . Moreover, the convergence rate is bounded above by O(q 1), provided that each Φk is (essentially) bounded.

Proof Because the measured systems are randomly selected, βi, i = 1, . . . , q can be considered as a sequence of independent random variables, each of which has the uniform distribution on Ω. This indicates that Φk(βi)x(t, βi), i = 1, . . . , q, are independent and identically distributed random variables on Rn. The expectation of each Φk(βi)x(t, βi), which is E(Φkxt) = 1 |Ω| R Φk(β)xt(β)dβ, coincides with 1 |Ω|mk(t) and satisﬁes E Φkxt =

Ω|Φk(β)xt(β)|dβ 1 |Ω| Φk H xt H = 1 |Ω| xt H < . The strong law of large numbers then implies mk(t) mk(t) almost surely as q (Billingsley, 1995). To derive the bound of the convergence rate, it is necessary to compute the tail probability P |mk(t) mk(t)| > ε . We claim that when Φk is essentially bounded by some constant C, then the random variable Φkxt has a ﬁnite variance. This follows from E|Φkxt|2 =

ΩΦ2 k(β)x2 t (β)dβ C2

Ωx2 t (β)dβ = C2 xt 2 H |Ω| < . By Chebyshev s inequality, we have

P |mk(t) mk(t)| > ε E|mk(t) mk(t)|2

Φk(βi)xt(βi) mk(t)

Φk(βi)xt(βi) E(Φkxt) 2 = |Ω|2

i=1 E Φk(βi)xt(βi) E(Φkxt) 2

i=1 Var(Φkxt) C2|Ω| xt 2 H qε2 O 1

where the second and third equalities follow from the independence of βi, giving the desired bound.

Because the bound on the convergence rate is independent of the order of moments, the entire sample moment sequence m(t) = (mk(t))k N converges to the moment sequence

Reinforcement Learning for Infinite-Dimensional Systems

m(t) with the rate O(q 1) as well. In practice, we compute the sample moments up to a ﬁnite order N, constituting an order-N truncated sample moment sequence bm N(t). This computation only requires basic linear algebra operations with low computational complexity. To illustrate this, we concatenate the states of the q measured systems into a vector b Xq(t) = [x (t, β1), . . . , x (t, βq)] Rqn and deﬁne the moment kernel matrix bΨN R(N+1) M, whose (i, j)-entry is given by Φi(βj). By the deﬁnition of sample moments in (19), the order-N truncated sample moment sequence, represented as a vector bm N(t) = [m 0(t), . . . , m N(t)] R(N+1)n, is given by

bm N(t) = bΨN In b Xq(t), (20)

where In denotes the n-by-n identity matrix and is the Kronecker product of matrices. In terms of computations, bΨN In essentially arranges the (N + 1)q pre-determined numbers Ψi(βj) into an (N + 1)n-by-qn (block) matrix with (N + 1)q n-by-n diagonal blocks, which is of O(1) complexity. The matrix multiplication in (20), in the worst case, requires (N + 1)qn scalar multiplications and (N + 1)(q 1)n additions, resulting in a complexity of O(Nqn). In total, the complexity of computing the order-N truncated moment sequence using (20) is O(NMn) in the worst-case scenario.

In addition, because the vector ﬁeld F of the moment kernelized system in (8) is the moment sequence of F from the parameterized ensemble system in (4), as discussed in Section 3.1, the above computational and sample eﬃciency analysis is directly applicable to learning the moment kernelized system.

4.1.2 Second-order policy search

To showcase the FRL algorithm with early stopped hierarchies, in this section, we propose a second-order PS algorithm and integrate it into the FRL structure with an early stopping criterion. The development of the PS algorithm is based on the theory of diﬀerential dynamic programming (Jacobson and Mayne, 1970; Mayne, 1966). As a major advantage, the algorithm does not involve time discretization and is directly applicable to inﬁnitedimensional continuous-time systems, particularly the moment kernelized ensemble system in (8). The main idea is to expand the value function into a Taylor series up to the second-order term and then derive an update rule represented in the form of diﬀerential equations. In the following, we will highlight the key steps in the development of the algorithm (see Appendix C for the detailed derivation).

Quadratic approximation of Hamilton-Jacobi-Bellman equations. Let , : M M R be the inner product on the RKHS M. For any variation δm M at m M, applying Taylor s theorem to the (M-component of) value function V : [0, T] M R yields V (t, m+δm) = V (t, m)+ DV (t, m), δm + 1

2 D2V (t, m) δm, δm +o(δm2), where the Hessian D2V (t, m) is identiﬁed with a bounded linear operator from M M, and D2V (t, m) δm denotes the evaluation of D2V (t, m) at δm (Lang, 1999). The quadratic approximation of the HJB equation in (14) is subsequently obtained by replacing V with

Zhang and Li

its Taylor expansion while neglecting the high-order term o(δm2), that is,

t V (t, m) + D

t DV (t, m), δm E + 1

t D2V (t, m) δm, δm E

n H(t, m + δm, u, DV (t, m)) + D D2V (t, m) δm, F(t, m + δm, u) Eo = 0, (21)

where H(t, m + δm, u, DV (t, m)) = r(t, m, u) + DV (t, m), F(t, m, u) is the Hamiltonian of the moment kernelized system.

Second-order policy search. The next step is to design an iterative algorithm to solve (21). The main idea is to generate a policy learning sequence u(k) using the update rule u(k+1) = u(k) + δu(k) such that m = limk m(k), where m(k) and m are the trajectories of the moment system steered by u(k) and u , respectively. Then, the (m , u ) solves (21). To ﬁnd δu(k), we note that because the pair (m(k), u(k)) satisﬁes the moment system in (8), δm(k) = 0 holds. Thus, the equation in (21) reduces to

t V (t, m(k)) + minu H(t, m(k), u, DV (t, m(k))) = 0, where the solution u(k) = argminu H(t, m(k), u, DV (t, m(k))) is expressed in terms of m(k) and DV (t, m(k)). Steered by this policy u(k), the trajectory of the moment system is no longer m(k). Hence, a variation δm(k) on m(k) is produced such that m(k) + δm(k) satisﬁes the moment system. This updates the equation in (21) to

t V (t, m(k)) + D

t DV (t, m(k)), δm(k)E + 1

t D2V (t, m(k)) δm(k), δm(k)E

n H(t, m(k) + δm(k), u(k) + δu, DV (t, m(k)))

+ D D2V (t, m(k)) δm(k), F(t, m(k) + δm(k), u(k) + δu) Eo = 0. (22)

To solve the minimization problem in (22), we expand the objective function into a Taylor series with respect to (δm(k), δu) up to the second-order term, and then compute the critical point δu(k) in terms of DV (t, m(k)) and D2V (t, m(k)). Consequently, it remains to ﬁnd DV (t, m(k)) and D2V (t, m(k)). Following the Taylor expansion, the equation in (22) becomes an algebraic second-order polynomial equation in δm(k), which is satisﬁed if all the coeﬃcients are 0. This yields a systems of three ordinary diﬀerential equations, given by

d dtδV (t, m(k)(t)) = H(t, m(k), u(k), DV (t, m(k)) H(t, m(k), u(k), DV (t, m(k))), (23)

d dt DV (t, m(k)(t)) = DH D2V ( F F(t, m(k), u(k))), (24)

d dt D2V (t, m(k)(t)) = D2H D F D2V D2V D F

with the terminal conditions DV (T, m(k)(T)) = D K(T, m(k)(T)) and D2V (T, m(k)(T)) = D K(T, m(k)(T)), where δV (t, m(k)(t)) = V (t, m(k)(t)) V (t, m(k)(t)) and V (t, m(k)(t)) is the state-value function of the moment system driven by the policy u(k). Moreover, to simplify the notations, all the functions in the equations in (24) and (25) without arguments are evaluated at (t, m(k), u(k)). The notation denotes the transpose of linear operators.

Reinforcement Learning for Infinite-Dimensional Systems

Early stopping criterion. Because the development of the second-order PS algorithm is based on the Taylor series approximation, it is required that the amplitudes of δm(k)

and δu(k) are small enough for all k. A necessary condition to guarantee this is to bound δV (k) = supt |δV (t, m(k)(t))| by a threshold η. When integrating the proposed second-order PS algorithm into each hierarchy of the FRL structure, say the ith hierarchy, the algorithm will be terminated if supt |b VNi(t, bm(k) Ni (t)) b VNi(t, bm(k 1) Ni (t))| > η. The resulting policy u(k)

will then be used as the initial condition for the (i + 1)th hierarchy. The FRL algorithm with the early stopped second-order policy search is shown in Algorithm 2.

Algorithm 2 Filtrated reinforcement learning for parameterized systems with early stopped second-order policy search hierarchies Input: Initial state x0, ﬁnal time T, projection error tolerance ε, value function variation

tolerance η, maximum number of policy search iterations K

2: Initialization: truncation order N0, control policy u(0), projection error P > ε, value function variation δV = 0, hierarchy level i = 0, policy search iteration number j = 0 while P > ε do

4: while δV η and j K do

Generate data by solving the parameterized system with the input u(j) and collect the rewards

6: Compute order Ni truncated moment kernelization: bm Ni(t), b FNi(t, bm Ni(t), u(0)), r(t, bm Ni(t), u(0)), K(T, bm Ni(T))

Compute δ b VNi(t, bm(j) Ni(t)) D b V Ni(t, bm(j) Ni(t)) and D2 b V Ni(t, bm(j) Ni(t)) for 0 t T by solving the system of diﬀerential equations in (23), (24) and (25)

8: u(j+1)(t) argmina H(t, bm(j) Ni, a, D b VNi(t, bm(j) Ni)) for all 0 t T

δV maxt |δ b V (t, bm(j) Ni)|

10: j j + 1 end while

12: bm Ni bm(j 1) Ni if i 1 then

14: P = maxt [0,T] b VNi(t, bm Ni(t)) b VNi 1(t, bm Ni 1(t)) end if

16: Ni+1 Ni + 1, u(0) u(j 1)

18: end while

u u(j 1) return u

4.2 Examples and simulations

In this section, we will demonstrate the performance and eﬃciency of the proposed FRL algorithm using parameterized ensemble systems arising from practical applications. All the simulations were run on an Apple M1 chip with 16 GB memory.

Zhang and Li

4.2.1 Infinite-dimensional linear quadratic regulators

Linear-quadratic (LQ) problems, those are, linear systems with state-value functions being quadratic in both the system states and control variables, are among the most fundamental control problems, which have been extensively studied for ﬁnite-dimensional linear systems (Brockett, 2015; Kwakernaak and Sivan, 1972; Sontag, 1998). However, LQ problems for parameterized ensemble systems deﬁned on inﬁnite-dimensional function spaces remain largely unexplored. In this example, we will ﬁll in this literature gap to approach such type of inﬁnite-dimensional LQ problems using the proposed FRL algorithm.

To illuminate the main idea as well as to demonstrate how FRL addresses the curse of dimensionality, we revisit the parameterized scalar linear system in (3), i.e. d dtx(t, β) = βx(t, β) + u(t), β [ 1, 1], with a ﬁnite-time horizon state-vale function given by V (t, xt) = R 1 1 h R T t x2(t, β) + u2(t) dtdβ + x2(T, β) i dβ. To apply the moment kernel transform, we

consider the case where the system evolves on the Hilbert space L2([0, 1], R) and choose the basis {Φk}k N to be the set of Chebyshev polynomials. In this case, the moment kernelized system and state-value function can be analytically derived as d dtm(t) = Am(t) + bu(t) and V (t, m(t)) = R T 0 m(t) 2 + 2u2(t) dt + m(T) 2, where A = 1

2(L + R) with L and R the leftand right-shift operators, given by (m0(t), m1(t), . . . ) 7 (m1(t), m2(t), . . . ) and (m0(t), m1(t), . . . ) 7 (0, m0(t), . . . ), respectively (see Appendix D.1 for the detailed derivation). However, the analytic form of the moment kernelized system and state-value function are not required in the implementation of FRL.

In the simulation, the ﬁnal time and initial condition for the parameterized system were set to T = 1 and x0 = 1, the constant function on [ 1, 1], respectively, and the tolerance for the value function variation at each hierarchy was chosen to be η = 1. The truncation order N was varied from 2 to 10, and in each case the evolution of the truncated moment kernelized system was approximated using the sample moments computed from the measurement data for 500 systems with their system parameters β uniformly sampled from [ 1, 1]. The simulation results generated by Algorithm 2 are shown in Figure 3. Speciﬁcally, Figure 3a shows the total reward b V (0, bm N(0)) (top panel) and the number of the policy search iterations (bottom panel) with respect to the truncation order (hierarchy level) N. We observe that the total reward converges to the minimum value after only 4 hierarchies of the algorithm, which demonstrates the high eﬃciency of FRL. Correspondingly, Figure 3b plots the policy u N learned from each hierarchy of FRL, which stabilizes in the shadowed region starting from N = 6. It is also worth mentioning that the computational time for running 10 hierarchies of the algorithm is only 3.97 seconds, indicating the low computational cost of FRL. As a result, the curse of dimensionality is eﬀectively mitigated.

Reinforcement Learning for Infinite-Dimensional Systems

2 3 4 5 6 7 8 9 10 0

Figure 3: Learning the ﬁnite-time horizon LQR policy for the parameterized ensemble system in (3) using FRL with early stopped hierarchies shown in Algorithm 2. In particular, (a) shows the total reward (top panel) and the number of the second-order PS iterations with respect to the truncation order (hierarchy level) N ranging from 2 to 20, and (b) plots the learned policy from each hierarchy of FRL.

To further demonstrate the advantages of FRL, we will show that it also resolves the convergence issue caused by applying classical RL algorithms to sampled parameterized systems, as discussed in Section 2.1. To this end, we revisit the inﬁnite-time horizon LQR problem presented there, that is, the same system in (3) with the state-value function given by V (xt) = R 1 1 R 0 e 2.5t x2(t, β) + u2(t) dtdβ. In this case, we use Algorithm 1 with the standard value iteration applied to each hierarchy, and the simulation results are shown in Figure 4. A comparison between Figures 4a and 1b reveals that now the projection error for both the learned value functions and optimal policies converge to 0. This means that the sequences of value functions and optimal policies generated by FRL in Algorithm 1 are Cauchy sequences, and hence necessarily converge to those of the parameterized system in (3) (Rudin, 1976). As further veriﬁcation, Figure 4b illustrates that the learned value functions and optimal policies stabilize in the corresponding shadowed regions.

Zhang and Li

5 10 15 20 0

Projection error

Figure 4: FRL resolves the convergence issue caused by applying classical RL algorithms to sampled parameterized systems. Algorithm 1 with the standard value iteration applied to each hierarchy is used to learn the inﬁnite-time horizon LQR policy and value function of the parameterized system in (3). Speciﬁcally, (a) plots the projection error of the learned value functions and optimal policies, and (b) shows the learned value function (top) optimal policy (bottom) at each hierarchy.

4.2.2 Filtrated reinforcement learning for parameterized nonlinear ensemble systems

In this section, we apply FRL to learn the optimal policy for a parameterized nonlinear system arising from quantum mechanics and quantum control. The policy learning problem here is the robust excitation of a nuclear spin sample, typically consisting of as many spins as the order of Avogadro s number ( 1023). This problem, also known as the pulse design problem, is crucial in quantum science and technology. For example, it enables all the applications of nuclear magnetic resonance (NMR) spectroscopy, including magnetic resonance imaging (MRI), quantum computing, quantum optics, and quantum information processing (Li and Khaneja, 2009; Li et al., 2022; Silver et al., 1985; Roos and Moelmer, 2004; Stefanatos and Li, 2011; Chen et al., 2011; Stefanatos and Li, 2014). The dynamics of nuclear spins immersed in a static magnetic ﬁeld, with respect to the rotating frame, is governed by the Bloch equation

d dtx(t, β) =

0 0 βu(t) 0 0 βv(t) βu(t) βv(t) 0

x(t, β), (26)

which was derived by the Swiss-American physicist Felix Bloch in 1946 (Cavanagh et al., 2010). In this system, the state variable x(t, β) denotes the bulk magnetization of spins, u(t) and v(t) represent the external radio frequency (rf) ﬁelds, and the system parameter β Ω= [1 δ, 1 + δ] with 0 < δ < 1 is referred to as the rf inhomogeneity, characterizing the phenomenon that spins in diﬀerent positions of the sample receive diﬀerent strength of the rf ﬁelds. In practice, the inhomogeneity can be up to δ = 40% of the strength of the applied rf ﬁelds (Nishimura et al., 2001). A typical policy learning task is to design the rf ﬁelds u(t) and v(t), with minimum energy, to steer the parameterized Bloch system in (26) from the equilibrium state x0(β) = (0, 0, 1) to the excited state x F (β) = (1, 0, 0) for all β. We formulate

Reinforcement Learning for Infinite-Dimensional Systems

this task as an RL problem over the function space L2(Ω, R3), i.e., x(t, ) L2(Ω, R3), with the state-value function deﬁned by V (t, xt) = R T t u2(t)+v2(t) dt+ R 1+δ 1 δ |x(T, β) x F (β)|2dβ, where | | denotes a norm on R3. We still use the Chebyshev polynomial basis {Φk}k N to deﬁne the moment transform for kernelizing the Bloch system and the state-value function. To guarantee the orthonomality of {Φk}k N, we ﬁrst rescale the range of β from [1 δ, 1+δ] to [ 1, 1] by the linear transformation β 7 η = (β 1)/δ, and then apply the moment kernel transform deﬁned in (6) (see Appendix D.2 for the detailed derivation). We then apply Algorithm 2 to learn the optimal policies for the spin system in (26). In the simulation, we considered the maximal rf inhomogeneity δ = 40% encountered in practice and pick the ﬁnal time to be T = 1. Similar to the previous example, we varied the moment truncation order N from 2 to 10; for each N we approximated the evolution of the truncated moment kernelized system using the sample moments computed from the measurement data for 500 systems in the ensemble with the system parameters uniformly sampled from [0.6, 1.4]. The results are shown in Figure 5. In particular, Figure 5a displays the total reward (top panel) and number of second-order policy search iterations (bottom panel) with respect to the moment truncation order N, that is, the hierarchy level, from which we observe their convergence at N = 9. Figure 5b plots the policy and value function learned from each hierarchy, which stabilize in the corresponding shadowed regions as the truncation order increases. The computational time is 42.30 seconds, which is much longer than that (3.97 seconds) for the LQR problem presented in Section 4.2.1. In part, this is because the order-N truncated moment system in this case is of dimension 3N, 3 times higher than the moment system in the LQR example. Additionally, the nonlinearity of the Bloch system increases the complexity of this policy learning task.

2 3 4 5 6 7 8 9 10 0

Figure 5: Policy learning for robust excitation of the nuclear spin system in (26) in the presence of rf inhomogeneity using FRL shown in Algorithm 2. In particular, (a) shows the total reward (top panel) and number of the second-order PS iterations with respect to the moment truncation order, that is, the hierarchy level, N ranging from 2 to 10, and (b) plots the policies learned from each hierarchy of the algorithm.

As mentioned previously, from the perspective of quantum physics, a goal of controlling the parameterized Bloch system is to steer the spins from the equilibrium state (0, 0, 1) to the excited state (1, 0, 0) uniformly regardless of the rf inhomogeneity. Note that because

Zhang and Li

0 0 0 0 0 1 0 1 0

0 0 1 0 0 0 1 0 0

in the Bloch system d dtx(t, β) = β[u(t)Ωy +

v(t)Ωx]x(t, β) are skew-symmetric matrices, we have d dt|x(t, β)|2 = d dt x (t, β)x(t, β) = d dtx (t, β) x(t, β) + x (t, β) d dtx(t, β) = x (t, β)(βuΩ y + vΩ x)x(t, β) + x (t, β)(βuΩy + vΩx)x(t, β) = 0. As a result, starting from x(0, β) = (0, 0, 1), |x(t, β)| = 1 holds for all t and β, meaning the system trajectory stays on the unit sphere S2 = {x R3 : |x| = 1}. To evaluate the excitation performance, it then suﬃces to examine x1(T, β) solely, the ﬁrst component of the ﬁnal state x(T, β), which is plotted in the top panel of Figure 6a as a function of β. Its average value, generally used as the measure of the control performance (Zhang and Li, 2015), is 1 2δ R 1+δ 1 δ x1(T, β)dβ = 0.9613, which is close to 1, indicating the good excitation performance of the learned policies. The bottom panel of Figure 6a shows the performance measure versus time. Figure 6b shows the entire trajectory of the Bloch system on S2, from which we observe that the learned policies indeed steer the system towards the excited state (1, 0, 0) , regardless of the rf inhomogeneity, as desired.

0.6 0.8 1 1.2 1.4 0

0 0.2 0.4 0.6 0.8 1 -0.5

Performance

Figure 6: Excitation performance of the policies learned by using Algorithm 2. In particular, (a) plots the ﬁrst component of the ﬁnal state of the parameterized Bloch system in (26) with respect to the rf inhomogeneity β, and (b) plots the entire trajectory of the system driven by the learned policies on the unit sphere.

4.2.3 Comparison with baseline deep reinforcement learning models

We conduct a comparative analysis between the proposed FRL architecture with baseline deep RL models. Given our problem formulation, where policies are deterministic and take values in continuous action spaces, we choose the Deep Deterministic Policy Gradient (DDPG) and Twin-Delayed Deep Deterministic Policy Gradient (TD3) as the baseline models. We test their performance on 500 systems in the parametrized linear ensemble in (3) as well as 500 systems in the parametrized Bloch ensemble in (26), with their system parameters randomly sampled from [ 1, 1] and [0.6, 1.4], respectively. In table 1, we compare the minimal costs and training times of DDPG and TD3 with those of the proposed FRL. We observe that FRL outperforms the baseline deep RL models in both the linear and Bloch system

Reinforcement Learning for Infinite-Dimensional Systems

examples. Notably, FRL achieves signiﬁcant computational eﬃciency, reducing the training time compared to deep RL models by two orders of magnitude.

Linear system FRL DDPG TD3 Minimal cost 3.40 59.86 50.35 Training time (s) 3.97 1607 1575 Bloch (nonlinear) system FRL DDPG TD3 Minimal cost 2.85 28.95 32.46 Training time (s) 42.30 2807 2240

Table 1: Comparison of the proposed FRL with baseline deep RL models, DDPG and TD3. The top and bottom panels show the comparison results for the parameterized linear system in (3) and (nonlinear) Bloch system in (26), respectively.

In addition, a major drawback of the deep RL models lies in their failure to retain the geometric structure of the Bloch system. Recall that the state of the parameterized Bloch system should stay on the unit sphere S2. However, as shown in Figure 7, neither the trajectories learned from DDPG and TD3 satisfy this requirement. This indicates that both the DDPG and TD3 agents fail to learn the evolution of the Bloch system.

Figure 7: Temporal evolution of the parameterized Bloch system in (26) learned by using DDPG (a) and TD3 (b). In both (a) and (b), the bottom panels show the the learned trajectories and the top panels plot the averaged norms of the trajectories over the entire ensemble.

5. Conclusion

In this paper, we propose a novel RL architecture designed to learn optimal policies for arbitrarily large populations of intelligent agents. Our approach models such populations as parameterized control systems deﬁned on inﬁnite-dimensional function spaces. To address the inherent challenges posed by inﬁnite-dimensionality, we develop the moment kernel transform,

Zhang and Li

which maps the parameterized system and its value function to an RKHS consisting of moment sequences. This transformation results in a kernel parameterization of the RL problem. We then structure the ﬁnite-dimensional truncated moment representations of the RL problem into a ﬁltrated framework and develop a hierarchical policy learning algorithm through this RL ﬁltration. Each hierarchy in this framework corresponds to an RL problem for a ﬁnite-dimensional truncated moment kernelized system. To enhance the computational eﬃciency of the FRL algorithm, we investigate early stopping criteria for each hierarchy and establish the convergence of the early-stopped algorithm by constructing a spectral sequence. Additionally, we quantitatively analyze the computational and sample eﬃciency of the FRL. Examples are provided to demonstrate the exceptional performance and high eﬃciency of the proposed algorithm, validating its eﬀectiveness in practical applications.

Acknowledgments

We are grateful to the anonymous reviewers and the AE for their insightful comments and suggestions, which have helped improve the presentation of this paper. This work was supported by the Air Force Oﬃce of Scientiﬁc Research under award FA9550-21-1-0335.

Reinforcement Learning for Infinite-Dimensional Systems

Appendix A. Proof of Lemma 2

Let xk(t, ) and x(t, ) in F(Ω, M) be the trajectories (solutions) of the ensemble system in (4), with the initial condition x(0, ), driven by the control inputs uk and u, respectively, for all k N, i.e.,

d dtxk(t, β) = F t, β, xk(t, β), uk(t) and d dtx(t, β) = F t, β, x(t, β), u(t) .

We claim that xk(t, ) converges to x(t, ) on F(Ω, M) pointwisely as k for any t [0, T], which is equivalent to xk(t, β) x(t, β) for any β Ωon each coordinate chart of M by using a partition of unity on M (Lee, 2012). Therefore, it suﬃces to assume that the entire trajectories xk([0, T], β) M and x([0, T], β) M for all k N and each β Ωare located in a single coordinate chart of M and hence equivalently in Rn. To prove the claim, we ﬁx an arbitrary β Ωand note that

xk(t, β) x(t, β) = F t, β, xk(t, β), uk(t)) F(t, β, x(t, β), u(t)

F t, β, xk(t, β), uk(t)) F(t, β, x(t, β), u(t)

F t, β, xk(t, β), uk(t) F t, β, x(t, β), uk(t)

+ F t, β, x(t, β), uk(t) F t, β, x(t, β), u(t) ,

where the second inequality follows from the triangle inequality. Because d

dt x(t, β) xk(t, β)

satisﬁes the same inequality as above, we obtain

xk(t, β) x(t, β) F t, β, xk(t, β), uk(t) F t, β, x(t, β), uk(t)

+ F t, β, x(t, β), uk(t) F t, β, x(t, β), u(t)

C xk(t, β) x(t, β) + F t, β, x(t, β), uk(t) F t, β, x(t, β), u(t)

where the second inequality follows from the Lipschitz continuity of F in the system state variable according to Assumption S2. Moreover, as solutions of ordinary diﬀerential equations, all of xk(t, β) and x(t, β) are Lipschitz continuous, and hence absolutely continuous, then so is |xk(t, β) x(t, β)|. Together with its nonnegativity, Gronwall s inequality can be applied (Evans, 2010), yielding

xk(t, β) x(t, β) e Ct Z t

F s, β, x(s, β), uk(s) F s, β, x(s, β), u(s) ds

Because F is continuous in the control policy variable, the pointwise convergence of uk(s) to u(s) implies that of F s, β, x(s, β), uk(s) to F s, β, x(s, β), u(s) . Then, by Egoroﬀ s Theorem, there exist sequences of real numbers εk > 0 and subsets Ik of [0, T] with Lebesgue measure εk such that εk 0 and F s, β, x(s, β), uk(s) F s, β, x(s, β), u(s) uniformly

Zhang and Li

on Ic k = [0, T]\Ik. Consequently, we have

xk(t, β) x(t, β) lim k e Ct Z t

F s, β, x(s, β), uk(s) F s, β, x(s, β), u(s) ds

e Ctn lim k

F s, β, x(s, β), uk(s) F s, β, x(s, β), u(s) ds

F s, β, x(s, β), uk(s) F s, β, x(s, β), u(s) ds o

= e Ctn Z t

F s, β, x(s, β), uk(s) F s, β, x(s, β), u(s) χIc k(s)ds

F s, β, x(s, β), uk(s) F s, β, x(s, β), u(s) ds o = 0,

in which χIc k denotes the characteristic function of Ic k, i.e., χIc k(s) = 1 for s Ic k and χIc k(s) = 0 for s Ic k, the change of the limit and integral in the ﬁrst term in the summand follows from the uniform convergence of F s, β, x(s, β), uk(s) to F s, β, x(s, β), u(s) on Ic k, and the second integral converges 0 because the Lebesgue of Ik goes to 0. This then proves the claim. Now, without loss of generality, we assume that |J(u)| < , and the existence of such a u is guaranteed by Assumption C1. Then, we obtain the desired convergence

lim k J(uk) = lim k

0 r(t, xk(t, β), uk(t))dtdβ + Z

Ω K(T, xk(T, β))dβ i

0 lim k r(xk(t, β), uk(t))dtdβ + Z

Ω lim k K(xk(T, β))dβ

0 r(t, lim k xk(t, β), lim k uk(t))dtdβ + Z

Ω K(T, lim k xk(T, β))dβ

0 r(t, x(t, β), u(t))dtdβ + Z

Ω K(T, x(T, β))dβ = J(u),

where the second and third equalities follow from the dominated convergence theorem and the continuity of r and K, respectively (Folland, 2013).

Appendix B. Proof of Proposition 8

Fix z, z M and t, t [0, T], then by the deﬁnition of the inﬁmum, for any ε > 0, there exists an ensemble control policy u U such that

V ( t, z) + ε Z T

t r(τ, m(τ), u(τ))dτ + K(T, m(T))

t r(τ, x(τ, β), u(τ))dτdβ + Z

Ω K(T, x(T, β))dβ,

where m(τ) satisﬁes the moment system d dτ m(τ) = F(τ, m(τ), u(τ)) with the initial condition m(s) = z and x(t, ) is the trajectory of the associated ensmeble system. Next, let m(τ) be the

Reinforcement Learning for Infinite-Dimensional Systems

trajectory of the moment system steered by the same control policy with the initial condition m(t) = z and x(t, ) be the associated ensemble trajectory. Without loss of generality, we assume that t t, and then we have

V (t, z) V ( t, z) Z

t r(τ, x(τ, β), u(τ))dτdβ + Z

Ω K(T, x(T, β))dβ

t r(τ, x(τ, β), u(τ))dτdβ Z

Ω K(T, x(T, β))dβ + ε

t r(τ, x(τ, β), u(τ))dτdβ + Z

h r(τ, x(τ, β), u(τ))

r(τ, x(τ, β), u(τ)) i dτdβ + Z

h K(T, x(T, β)) K(T, x(T, β)) i dβ + ε.

Because ﬁnite-time solutions of ordinary diﬀerential equations with initial conditions in a bounded set remain bounded, |r(τ, x(τ, β), u(τ))| M holds for some M over τ [t, t] so that the ﬁrst term in (27) can be bounded as

t r(τ, x(τ, β), u(τ))dτdβ Z

t |r(τ, x(τ, β), u(τ))|dτdβM Vol(Ω)|t t|.

Then, by using the Lipschtiz continuity of r and K as in Assumption C2, the second and third terms in (27) satisﬁes Z

h r(τ, x(τ, β), u(τ)) r(τ, x(τ, β), u(τ)) i dτdβ + Z

h K(T, x(T, β)) K(T, x(T, β)) i dβ

t Cr x(τ, β) x(τ, β) dτdβ + Z

Ω CK x(T, β) x(T, β) dβ

Ω (TCr + CK)C x( t, β) x( t, β) dβ

(TCr + CK)C Z

h x(t, β) x( t, β) + x(t, β) x( t, β) i dβ

(TCr + CK)C h Z

x(t, β) x( t, β) dβ + Z

x(t, β) x( t, β) dβ i , (28)

where the second inequality follows from Gronwall s inequality. Note that the ﬁrst integral in (28) is exactly the L1-norm of x(t, ) x(t, ), which is equal to z z , the norm of the associated moment sequences, since the moment transformation is an isometry. For the second term, we use the Lipschitz continuity of solutions to ordinary diﬀerential equations to conclude R

Ω x(t, β) x( t, β) C Vol(Ω)|t t| for some C > 0. Now, let C = max{(TCr+CK)C , MVol(Ω), Vol(Ω)}, then we obtain V (t, z) V ( t, z) C(|t t|+ z z ) since ε > 0 is arbitrary. The same argument with the roles of (t, z) and ( t, z) reversed implies

|V (t, z) V ( t, z)| C(|t t| + z z ),

giving the Lipschitz continuity of V on [0, T] M as desired.

Zhang and Li

Appendix C. Derivation of Second-Order Policy Search Update Equations

Quadratic approximation of Hamilton-Jacobi-Bellman equations. We pick a nominal policy u(t), which generates a nominal trajectory m(t) by applying u(t) to the moment system in (8). Then, the optimal policy and trajectory can be represented as u (t) = u(t)+δu(t) and m (t) = m(t)+δm(t), respectively, plugging which into the Hamiltonian-Jacobi-Bellman equation in (14), i.e.,

t V (t, z) + min a U r(t, z, a) + DV (t, z), F(t, z, a) = 0

t V (t, m + δm) + min δu r(t, m + δm, u + δu)

+ DV (t, m + δm), F(t, m + δm, u + δu) = 0,

where, and in the following as well, we drop the time argument t from the state and control variables for the conciseness of the representation. Now, we assume that the value function V : [0, T] M R is smooth enough, at least in the region containing the nominal and optimal trajectories, to admit a second-order power series expansion as

V (t, m + δm) = V (t, m) + DV (t, m), δm + 1

2 D2V (t, m) δm, δm + o(δm2)

= V (t, m) + δV (t, m) + DV (t, m), δm + 1

2 D2V (t, m) δm, δm + o(δm2), (29)

where we further expand V (t, m) by using the nominal cost V (t, m) = R T t r(s, m(s), u(s))ds+ K(T, m(T)), that is, the cost obtained by applying the nominal control input u to the system starting from m(t) at time t. To explain the second-order term in the above expansion, by the deﬁnition, the second derivative D2V (t, m) of the real-valued function V (t, ) : M R evaluated at m is a bounded linear map from M to M (Lang, 1999). In our notation, D2V (t, m) δm denotes the evaluation of D2V (t, m) at δm, giving an element in M that can be paired with δm M. Conceptually, D2V (t, m) is nothing but the inﬁnite-dimensional Hessian matrix with the (i, j)-entry given by 2V zi zj |(t, m). If δm is small enough to ensure a suﬃciently accurate approximation of the value function V (t, m) up to the second-order terms, then we integrate (29) with the o(δm2) term neglected into the Hamilton-Jacobi-Bellman equation, yielding

t V (t, m) +

tδV (t, m) + D

t DV (t, m), δm E + 1

t D2V (t, m) δm, δm E

n H(t, m + δm, u + δu, DV (t, m)) + D D2V (t, m) δm, F(t, m + δm, u + δu) Eo = 0

with the system Hamiltonian H(t, m, u, DV ) = r(t, m, u) + DV (t, m), F(t, m, u) , which is the key equation to the development of the algorithm for successively improving the nominal control policy u.

Reinforcement Learning for Infinite-Dimensional Systems

Second-order policy search. To initialize the algorithm, we start with a nominal control policy u0, applying which to the moment system generates a nominal trajectory m(0). Because the pair (m(0), u(0)) satisﬁes the system, the variation of the trajectory is δm = 0. In this case, the second-order expanded Hamilton-Jacobi-Bellman equation in (30) takes the form

t V (t, m(0)) +

tδV (t, m(0)) + min δu H(t, m(0), u(0) + δu, DV (t, m(0))) = 0, (31)

and in many cases, the minimization of the Hamiltonian can be solved analytically, giving a new control policy u(0) , represented in terms of m(0) and DV (t, m(0)) as a feedback control. However, steered by this policy, the system trajectory may not be the nominal one anymore, and hence a variation δm is introduced to the nominal trajectory as m(0) + δm. Correspondingly, the second-order expanded Hamilton-Jacobi-Bellman equation becomes the one in (30) with u replaced by u(0) , in which the minimization is taken over δu for the function

H(t, m(0) + δm, u(0) + δu, DV (t, m(0))) + D D2V (t, m(0)) δm, F(t, m(0) + δm, u(0) + δu) E .

We further expand this function around (m(0), u(0) ) up to the second-order terms, yielding

u ,δu E + D DH + D2V F, δm E + D DH

u D2V δm, δu E

u2 δu, δu E + 1

D D2H + D F D2V + D2V DF δm, δm E , (32)

where the terms involving H, F, and V are evaluated at (t, m(0), u(0) , DV (t, m(0))), (t, m(0), u(0) ), and (t, m(0)), respectively, and denotes the dual of a linear operator. For example, because F(t, m0, u 0) M by identifying the tangent space of M at m0 with M itself, D F(t, m(0), u(0) ) : M M is a linear map, then its dual operator is deﬁned as a linear map D F (t, m0, u 0) : M M satisfying L, D F(t, m(0), u(0) ) z = D F (t, m(0), u(0) ) L, z for any z M and L M (Yosida, 1980). Conceptually, dual operators are nothing but transpose matrices. Next, to minimize the function in (32), the necessary condition is the vanishing of its derivative with respect to δu, giving

u D2V δm = 0, (33)

in which it is necessary that H

u |(t,m(0),u(0) ,DV (t,m(0))) = 0 since u minimizes the Hamiltonian

H(t, m(0), u(0) , DV (t, m(0))) by our choice. Recall that our intention is to approximate the Hamiton-Jacobi-Bellman equation up to the second-order term in δm. Therefore, it is required that δm and δu are in the same order, meaning, they satisfy a linear relationship; otherwise, say δu is quadratic in δm, then the terms D DH

u D2V δm, δu E and 1

u2 δu, δu E in (32) are of orders higher than

δm2. Formally, there is a linear map A : M Rm such that δu = A δm. To ﬁnd A, we replace δu by A δm in the necessary optimality condition in (33), leading to

Zhang and Li

With this choice of A, the function in (32) becomes

H + D DH+D2V F, δm E

D D2H + D F D2V + D2V DF A D2H A δm, δm E

so that the second-order expansion of the Hamilton-Jacobi-Bellman equation in (30) takes the form

t DV, δm E + 1

t D2V (t, m) δm, δm E + H + D DH + D2V F, δm E

D D2H + D F D2V + D2V DF A 2H

u2 A δm, δm E = 0, (34)

Because δm is arbitrary, the coeﬃcient of each order of δm must be 0, which transforms (34) into a system of three partial diﬀerential equations

t DV DH D2V F = 0

t D2V D2H D F D2V D2V DF + A 2H

with terms involving H, F, and V are evaluated at (t, m(0), u(0) , DV (t, m(0))), (t, m(0), u(0) ), and (t, m(0)), respectively, as before. Integrating with the chain rule as

d dt( V + δV ) =

tδV + DV, F(t, m(0), u(0))

t DV + D2V F(t, m(0), u(0))

2D3V F(t, m(0), u(0)) + 1

2 F(t, m(0), u(0)) D3V

then gives three ordinary diﬀerential equations

d dtδV (t, m(0)(t)) = H(t, m(0), u(0), DV (t, m(0))) H(t, m(0), u(0) , DV (t, m(0))), (35)

d dt DV (t, m(0)(t)) = DH D2V ( F(t, m(0), u(0) ) F(t, m(0), u(0))), (36)

d dt D2V (t, m(0)(t)) = D2H D F D2V D2V D F

where we use d dt V (t, m(0)(t)) = d dt R T t r(s, m(0)(s), u(0)(s))ds + K(T, m(0)(T)) = r(t,

m(0)(s), u(0)(s)) and A = 2H

u D2V , and omit the third-order

Reinforcement Learning for Infinite-Dimensional Systems

terms involving D3V . Moreover, because the value function satisﬁes V (T, m(0)(T)) = K(T, m(0)(T)), we have the terminal conditions for the ordinary diﬀerential equations in (35) to (36) as δV (T, m(0)(T)) = 0, DV (T, m(0)(T)) = D K(T, m(0)(T)), and D2V (T, m(0)(T)) = D2 K(T, m(0)(T)). In particular, the data DV (t, m(0)(t)) obtained from solving the above systems of diﬀerential equations is then used to compute the control policy u(0) , which has been represented in terms of m(t) and DV (t, m(0)(t)) when minimizing the Hamitonian in (31) and hence gives rise to an improvement of the nominal control policy u(0). This in turn completes one iteration of the proposed policy search algorithm.

Appendix D. Derivation of Moment Systems

D.1 Inﬁnite-dimensional LQR

In the following, we consider the ﬁnite-time horizion LQR problem

dtx(t, β) = βx(t, β) + u(t),

V (t, xt) = R

Ω h R T 0 x2(t, β) + u2(t) dtdβ + x2(T, β) i dβ, (38)

where Ω= [ 1, 1] and xt L2(Ω), the space of real-valued square-integrable functions deﬁned on Ω.

Moment kernelization. We pick {Φk}k N to be the set of Chebyshev polynomials, and by using the recursive relation of Chebyshev polynomials 2Φk = Φk 1 + Φk 1, we have

d dtmk(t) = d

1 Φk(β)x(t, β)dβ = Z 1

dtx(t, β)dβ = Z 1

1 Φk(β) βx(t, β) + u(t) dβ

Φk 1(β) + Φk+1(β) x(t, β)dβ + Z 1

1 Φk(β)dβ u(t)

2 mk 1(t) + mk+1(t) + bku(t),

where the change of the integral and time derivative follows from the dominanted convergence theorem (Folland, 2013), Φ 1, and hence m 1, are deﬁned to be identically 0, and bk is given by ( 1)k+1

1 k2 for k = 1 and 0 otherwise. We further let L : ℓ2 ℓ2 and R : ℓ2 ℓ2

denote the left and right shift operators, given by, (m0(t), m1(t), . . . ) 7 (m1(t), m2(t), . . . ) and (m0(t), m1(t), . . . ) 7 (0, m0(t), . . . ), respectively, then the moment system associated with the linear ensemble system in (38) is a linear system evolving on ℓ2 in the form d dtm(t) = Am(t) + Bu(t) with A = 1

2(L + R) and B ℓ2 whose kth component is bk. On the other hand, to parameterize the cost functional, we note that the moment transformation, that is, the Fourier transform, is a unitary operator from L2(Ω) to ℓ2, as a result of which J(u) = R 1 0 m(t) 2+2u2(t) dt+ m(T) 2 in the moment parameterization, where denotes the ℓ2 norm. In summary, the LQR problem in (38) in the moment kernel parameterization has the form ( d

dtm(t) = Am(t) + Bu(t),

V (t, m(T)) = R 1 0 h m(t) 2 + 2u2(t) i dt + m(T) 2. (39)

Zhang and Li

Second-order policy search. The Hamitonian H : ℓ2 R ℓ2 R of the moment system in (39) is given by (z, a, p) 7 z 2 + 2a2 + p, Az + Ba , in which , is the ℓ2

inner product. Let V : [0, T] ℓ2 R be the value function, then along a trajectory m(t) of the system, by setting H

a |(m(t),a,DV (t,m(t))) = 4a + DV (m(t)), B = 0, we obtain u (t) = 1

4 DV (t, m(t)), B . Consequently, the diﬀerential equations in (35) to (37) for the policy improvement algorithm read

d dtδV (t, m(t)) = h u + 1

4 DV, B i2 ,

d dt DV (t, m(t)) = 2m A DV D2V u 1

d dt D2V (t, m(t)) = 2I A D2V D2V A + 1

4D2V B B D2V,

where I denotes the identity operator on ℓ2 and we use the fact that D2V is a self-adjoint operator on ℓ2. More concretely, when applying Algorithm 2 to a truncated moment system, say of truncation order N, then A and B in the above system of diﬀerential equations are replaced by

1 0 ... ... ... 1 1 0

R(N+1) (N+1) and b BN =

b0 b1 b2 ... b N

respectively, and the operator dual is essentially the matrix transpose.

D.2 Moment kernelization of nuclear spin systems

The policy learning problem for the nuclear spin systems in (26) is given by ( d dtx(t, β) = β u(t)Ωy + v(t)Ωx x(t, β), V (t, xt) = R T t u2(t) + v2(t) dt + R 1+δ 1 δ |x(T, β) x F (β)|2dβ,

where the Bloch system is deﬁned on L2(Ω, R3) with Ω= [1 δ, 1 + δ] for some 0 < δ < 1.

Moment kernelization. Similar to the LQR case presented above, we still deﬁne the moments by using the set of Chebyshev polynomials {Φk}k N. However, in order to fully utilize the orthonormal property of Chebyshev polynomials, which only holds on Ω = [ 1, 1], we consider the transformation ψ : Ω Ω , given by β 7 (β 1)/δ, and deﬁned the moments by

Ω Φk ψ xtdλ = Z

Ω Φ xt ψ 1dψ#λ,

where λ denotes the Lebesgue measure on R, and ψ#λ is the pushforward measure of λ by ψ, that is, ψ#λ(I) = λ(ψ 1(I)) = δλ(I) is satisﬁed for I R, equivalently dψ#λ = δdλ. This directly implies that the cost functional in the moment parameterization takes the

Reinforcement Learning for Infinite-Dimensional Systems

form J(u, v) = R T 0 (u2(t) + v2(t))dt + m(T) 2, where denotes the ℓ2-norm on the R3-valued sequences, given by, m(T) 2 = P k=0 |mk(T)|2. Next, we compute the moment parameterization of the Bloch system in (26) as follows

d dtmk(t) = d

1 Φk(η) x(t, ψ 1(η))dψ#λ(η) = Z 1

dtx(t, ψ 1(η))dψ#λ(η)

1 Φk(η)(δη + 1) u(t)Ωy + v(t)Ωx x(t, ψ 1(η))dψ#λ(η)

= u(t)Ωy + v(t)Ωx nδ

2 mk 1(t) + mk+1(t) + mk(t) o ,

where we use the recursive relation satisﬁed by Chebyshev polynomials, i.e., 2Φk = Φk 1 + Φk+1, together with Φ 1 = 0. Let L and R be the left and right shift operators for real-valued sequences as introduced in Section 4.2.1, the moment parameterization of the Bloch ensemble system is given by d

dtm(t) = h δ 2(R + L) + I i u(t)Ωy + v(t)Ωx m(t), where I denotes the identity operator for real-valued sequences and be the tensor product of linear operators. As a result, we have obtained the moment kernel parameterization of the RL problem for the Bloch system as

( d dtm(t) = u(t)By + v(t)Bx m(t), V (t, m(t)) = R T 0 u2(t) + v2(t) dt + m(T) m F 2 (40)

where we deﬁne Bx = δ(R + L)/2 + I Ωx and By = δ(R + L)/2 + I Ωy to simplify the notations, and m0 and m F are the moment sequences of the constant functions x0(β) = (0, 0, 1) and x F (β) = (1, 0, 0) , respectively.

Second-order policy search. The Hamiltonian of the moment system in (40) is given by H : (ℓ2)3 R R (ℓ2)3 R, (z, a, b, p) 7 a2+b2+ p, a By+b Bx z . Let V : [0, T] ℓ2 R be the value function, then along a moment trajectory, by setting H

a |(m(t),a,b,DV (t,m(t))) = 2a + DV, Bym(t) = 0 and H

b |(m(t),a,b,DV (t,m(t))) = 2b + DV, Bxm(t) = 0, we obtain the optimal policies u (t) = DV, Bym(t) /2 and v (t) = DV, Bym(t) /2. Integrating them into the system of diﬀerential equations in (35) to (37) for the policy improvement algorithm yields

d dtδV (t, m(t)) = (u u )2 + (v v )2,

d dt DV (t, m(t)) = DV (u By + v Bx) D2V (u u)By + (v v)Bx m,

d dt D2V (t, m(t)) = (u By + v Bx) D2V D2V (u By + v Bx),

2(DV By + m B y D2V ) (DV By + m B y D2V )

2(DV Bx + m B x D2V ) (DV Bx + m B x D2V ).

Zhang and Li

Speciﬁcally, when applying Algorithm 2 to the order N truncated problem, Bx and By are replaced by the R3(N+1) 3(N+1) block matrices

(c Bx)N = 1

0 Ωx Ωx 0 Ωx

Ωx 0 ... ... ... Ωx Ωx 0

and (c By)N = 1

0 Ωy Ωy 0 Ωy

Ωy 0 ... ... ... Ωy Ωy 0

respectively.

A general reinforcement learning algorithm that masters chess, shogi, and go through self-play. Science, 362(6419):1140 1144, 2018.

S.V. Albrecht, F. Christianos, and L. Schäfer. Multi-Agent Reinforcement Learning: Foundations and Modern Approaches. MIT Press, 2024.

Vladimir I. Arnold. Ordinary Diﬀerential Equations. MIT Press, 1978.

Bowen Baker, Ingmar Kanitscheider, Todor Markov, Yi Wu, Glenn Powell, Bob Mc Grew, and Igor Mordatch. Emergent tool use from multi-agent autocurricula. In International Conference on Learning Representations, 2020.

Aaron Becker and Timothy Bretl. Approximate steering of a unicycle under bounded model perturbation using ensemble control. IEEE Transactions on Robotics, 28(3):580 591, 2012.

Marc G. Bellemare, Salvatore Candido, Pablo Samuel Castro, Jun Gong, Marlos C. Machado, Subhodeep Moitra, Sameera S. Ponda, and Ziyu Wang. Autonomous navigation of stratospheric balloons using reinforcement learning. Nature, 588(7836):77 82, 2020.

R. Bellman. Adaptive Control Processes: A Guided Tour. Princeton Legacy Library. Princeton University Press, 1961.

R. Bellman, Rand Corporation, and Karreman Mathematics Research Collection. Dynamic Programming. Rand Corporation research study. Princeton University Press, 1957.

A. Bensoussan, J. Frehse, and P. Yam. Mean Field Games and Mean Field Type Control Theory. Springer Briefs in Mathematics. Springer New York, 2013. ISBN 9781461485087.

D. Bertsekas and J.N. Tsitsiklis. Neuro-Dynamic Programming. Athena Scientiﬁc, 1996.

Dimitri Bertsekas. Reinforcement Learning and Optimal Control. Athena Scientiﬁc, 2019.

P. Billingsley. Probability and Measure, volume 245 of Wiley Series in Probability and Statistics. Wiley, 3 edition, 1995.

Roger W. Brockett. Finite Dimensional Linear Systems, volume 74 of Classics in Applied Mathematics. Society for Industrial and Applied Mathematics, 2015.

Reinforcement Learning for Infinite-Dimensional Systems

Marin Bukov, Alexandre G. R. Day, Dries Sels, Phillip Weinberg, Anatoli Polkovnikov, and Pankaj Mehta. Reinforcement learning in diﬀerent phases of quantum control. Phys. Rev. X, 8:031086, Sep 2018.

Lucian Buşoniu, Robert Babuška, and Bart De Schutter. Multi-agent Reinforcement Learning: An Overview, pages 183 221. Springer Berlin Heidelberg, Berlin, Heidelberg, 2010.

René Carmona, Mathieu Laurière, and Zongjun Tan. Linear-quadratic mean-ﬁeld reinforcement learning: convergence of policy gradient methods. ar Xiv preprint ar Xiv:1910.04295, 2019.

René Carmona, Kenza Hamidouche, Mathieu Laurière, and Zongjun Tan. Policy optimization for linear-quadratic zero-sum mean-ﬁeld type games. In 2020 59th IEEE Conference on Decision and Control (CDC), pages 1038 1043, 2020.

John Cavanagh, Nicholas J. Skelton, Wayne J. Fairbrother, Mark Rance, and III Arthur G. Palmer. Protein NMR Spectroscopy: Principles and Practice. Elsevier, 2 edition, 2010.

Chunlin Chen, Daoyi Dong, Ruixing Long, Ian R. Petersen, and Herschel A. Rabitz. Samplingbased learning control of inhomogeneous quantum ensembles. Phys. Rev. A, 89:023402, Feb 2014. doi: 10.1103/Phys Rev A.89.023402.

X. Chen, E. Torrontegui, D. Stefanatos, J.-S. Li, and J. G. Muga. Optimal trajectories for eﬃcient atomic transport without ﬁnal excitation. Physical Review A, 84:043415, 2011.

Shi Nung Ching and Jason T. Ritt. Control strategies for underactuated neural ensembles driven by optogenetic stimulation. Front Neural Circuits, 7:54, 2013.

Daoyi Dong, Chunlin Chen, Hanxiong Li, and Tzyh-Jong Tarn. Quantum reinforcement learning. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), 38 (5):1207 1220, 2008.

Ethan N. Evans, Marcus A. Periera, George I. Boutselis, and Evangelos A. Theodorou. Variational optimization based reinforcement learning for inﬁnite dimensional stochastic systems. In Leslie Pack Kaelbling, Danica Kragic, and Komei Sugiura, editors, Proceedings of the Conference on Robot Learning, volume 100 of Proceedings of Machine Learning Research, pages 1231 1246. PMLR, 30 Oct 01 Nov 2020.

Lawrence C. Evans. Partial Diﬀerential Equations, volume 19 of Graduate Studies in Mathematics. American Mathematical Society, 2nd edition, 2010.

Jakob Foerster, Ioannis Alexandros Assael, Nando de Freitas, and Shimon Whiteson. Learning to communicate with deep multi-agent reinforcement learning. In D. Lee, M. Sugiyama, U. Luxburg, I. Guyon, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 29. Curran Associates, Inc., 2016.

Gerald B. Folland. Real Analysis: Modern Techniques and Their Applications, volume 40 of Pure and Applied Mathematics: A Wiley Series of Texts, Monographs and Tracts. John Wiley & Sons, 2 edition, 2013.

Zhang and Li

Vincent François-Lavet, Peter Henderson, Riashat Islam, Marc G. Bellemare, and Joelle Pineau. An introduction to deep reinforcement learning. Foundations and Trends R in Machine Learning, 11(3-4):219 354, 2018.

Zuyue Fu, Zhuoran Yang, Yongxin Chen, and Zhaoran Wang. Actor-critic provably ﬁnds nash equilibria of linear-quadratic mean-ﬁeld games. In International Conference on Learning Representations, 2020.

Carles Gelada, Saurabh Kumar, Jacob Buckman, Oﬁr Nachum, and Marc G. Bellemare. Deep MDP: Learning continuous latent space models for representation learning. In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors, Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pages 2170 2179. PMLR, 09 15 Jun 2019.

S. J. Glaser, T. Schulte-Herbrüggen, M. Sieveking, N. C. Nielsen O. Schedletzky, O. W. Sørensen, and C. Griesinger. Unitary control in quantum ensembles, maximizing signal intensity in coherent spectroscopy. Science, 280:421 424, 1998.

Jayesh K. Gupta, Maxim Egorov, and Mykel Kochenderfer. Cooperative multi-agent control using deep reinforcement learning. In Gita Sukthankar and Juan A. Rodriguez-Aguilar, editors, Autonomous Agents and Multiagent Systems, pages 66 83, Cham, 2017. Springer International Publishing. ISBN 978-3-319-71682-4.

Hans L. Hamburger. Über eine erweiterung des stieltjesschen momentenproblems. Mathematische Annalen, 82:120 164, 1920.

Hans L. Hamburger. Über eine erweiterung des stieltjesschen momentenproblems. Mathematische Annalen, 82:168 187, 1921a.

Hans L. Hamburger. Über eine erweiterung des stieltjesschen momentenproblems. Mathematische Annalen, 81:235 319, 1921b.

T. Hastie, R. Tibshirani, and J.H. Friedman. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer series in statistics. Springer, 2009.

Tobias Haug, Rainer Dumke, Leong-Chuan Kwek, Christian Miniatura, and Luigi Amico. Machine-learning engineering of quantum currents. Phys. Rev. Res., 3:013034, Jan 2021.

Felix Hausdorﬀ. Momentprobleme für ein endliches intervall. Mathematische Zeitschrift, 16 (1):220 248, 1923.

Suzana Herculano-Houzel. The remarkable, yet not extraordinary, human brain as a scaled-up primate brain and its associated cost. Proceedings of the National Academy of Sciences, 109(supplement_1):10661 10668, 2012.

Paulo Heredia, Hasan Ghadialy, and Shaoshuai Mou. Finite-sample analysis of distributed q-learning for multi-agent networks. In 2020 American Control Conference (ACC), pages 3511 3516, 2020.

Reinforcement Learning for Infinite-Dimensional Systems

Paulo Heredia, Jemin George, and Shaoshuai Mou. Distributed oﬄine reinforcement learning. In 2022 IEEE 61st Conference on Decision and Control (CDC), pages 4621 4626, 2022.

Paulo C. Heredia and Shaoshuai Mou. Distributed multi-agent reinforcement learning by actor-critic method. IFAC-Papers On Line, 52(20):363 368, 2019. ISSN 2405-8963. 8th IFAC Workshop on Distributed Estimation and Control in Networked Systems NECSYS 2019.

David H. Jacobson and David Q. Mayne. Diﬀerential Dynamic Programming. Modern Analytic and Computational Mehtods in Sciencen and Mathematics. American Elsevier Publishing Company, 1970.

Wei-Cheng Jiang, Vignesh Narayanan, and Jr-Shin Li. Model learning and knowledge sharing for cooperative multiagent systems in stochastic environment. IEEE Transactions on Cybernetics, 51(12):5717 5727, 2021.

Lukasz Kaiser, Mohammad Babaeizadeh, Piotr Milos, Blazej Osinski, Roy H Campbell, Konrad Czechowski, Dumitru Erhan, Chelsea Finn, Piotr Kozakowski, Sergey Levine, Afroz Mohiuddin, Ryan Sepassi, George Tucker, and Henryk Michalewski. Model based reinforcement learning for atari. In International Conference on Learning Representations, 2020.

Alec Koppel, Garrett Warnell, Ethan Stump, Peter Stone, and Alejandro Ribeiro. Policy evaluation in continuous mdps with eﬃcient kernelized gradient temporal diﬀerence. IEEE Transactions on Automatic Control, 66(4):1856 1863, 2021.

Huibert Kwakernaak and Raphel Sivan. Linear Optimal Control Systems. Wiley-Interscience, 1972.

Lucas Lamata. Basic protocols in quantum reinforcement learning with superconducting circuits. Scientiﬁc Reports, 7(1):1609, 2017. doi: 10.1038/s41598-017-01711-6.

Serge Lang. Fundamentals of Diﬀerential Geometry, volume 191 of Graduate Texts in Mathematics. Springer New York, NY, 1999.

Michael Laskin, Aravind Srinivas, and Pieter Abbeel. CURL: Contrastive unsupervised representations for reinforcement learning. In Hal Daumé III and Aarti Singh, editors, Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pages 5639 5650. PMLR, 13 18 Jul 2020.

Mathieu Laurière, Sarah Perrin, Sertan Girgin, Paul Muller, Ayush Jain, Théophile Cabannes, Georgios Piliouras, Julien P erolat, Romuald Élie, Olivier Pietquin, and Matthieu Geist. Scalable deep reinforcement learning algorithms for mean ﬁeld games. In International Conference on Machine Learning, 2022.

Ngan Le, Vidhiwar Singh Rathour, Kashu Yamazaki, Khoa Luu, and Marios Savvides. Deep reinforcement learning in computer vision: a comprehensive survey. Artiﬁcial Intelligence Review, 55(4):2733 2819, 2022.

Zhang and Li

John M. Lee. Introduction to Smooth Manifolds, volume 218 of Graduate Texts in Mathematics. Springer New York, NY, 2nd edition, 2012.

Guy Lever and Ronnie Staﬀord. Modelling policies in mdps in reproducing kernel hilbert space. In Guy Lebanon and S. V. N. Vishwanathan, editors, Proceedings of the Eighteenth International Conference on Artiﬁcial Intelligence and Statistics, volume 38 of Proceedings of Machine Learning Research, pages 590 598, San Diego, California, USA, 09 12 May 2015. PMLR.

Jr-Shin Li. Control of inhomogeneous ensembles, May 2006.

Jr-Shin Li. Ensemble control of ﬁnite-dimensional time-varying linear systems. IEEE Transactions on Automatic Control, 56(2):345 357, 2011.

Jr-Shin Li and Navin Khaneja. Ensemble control of bloch equations. IEEE Transactions on Automatic Control, 54(3):528 536, 2009.

Jr-Shin Li, Justin Ruths, Tsyr-Yan Yu, Haribabu Arthanari, and Gerhard Wagner. Optimal pulse design in quantum control: A uniﬁed computational method. Proceedings of the National Academy of Sciences, 108(5):1879 1884, 2011.

Jr-Shin Li, Isuru Dasanayake, and Justin Ruths. Control and synchronization of neuron ensembles. IEEE Transactions on Automatic Control, 58(8):1919 1930, 2013. doi: 10. 1109/TAC.2013.2250112.

Jr-Shin Li, Wei Zhang, and Yuan-Hung Kuan. Moment quantization of inhomogeneous spin ensembles. Annual Reviews in Control, 54:305 313, 2022. ISSN 1367-5788.

Jr-Shin Li, Yuan-Hung Kuan, and Wei Zhang. Optimal quantum control using ensemble quantization. SIAM Journal on Control and Optimization, 63(1):S107 S127, 2025.

Michael L. Littman. Markov games as a framework for multi-agent reinforcement learning. In William W. Cohen and Haym Hirsh, editors, Machine Learning Proceedings 1994, pages 157 163. Morgan Kaufmann, San Francisco (CA), 1994.

Siqi Liu, Guy Lever, Zhe Wang, Josh Merel, S. M. Ali Eslami, Daniel Hennes, Wojciech M. Czarnecki, Yuval Tassa, Shayegan Omidshaﬁei, Abbas Abdolmaleki, Noah Y. Siegel, Leonard Hasenclever, Luke Marris, Saran Tunyasuvunakool, H. Francis Song, Markus Wulfmeier, Paul Muller, Tuomas Haarnoja, Brendan Tracey, Karl Tuyls, Thore Graepel, and Nicolas Heess. From motor control to team play in simulated humanoid football. Science Robotics, 7(69):eabo0235, 2022.

Pinxin Long, Tingxiang Fan, Xinyi Liao, Wenxi Liu, Hao Zhang, and Jia Pan. Towards optimally decentralized multi-robot collision avoidance via deep reinforcement learning. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pages 6252 6259, 2018.

Zehui Lu, Tianyu Zhou, and Shaoshuai Mou. Real-time multi-robot mission planning in cluttered environment. Robotics, 13(3), 2024.

Reinforcement Learning for Infinite-Dimensional Systems

George W. Mackey. Harmonic analysis as the exploitation of symmetry - a historical survey. Bulletin (New Series) of the American Mathematical Society., 3(1):543 698, 1980.

W.J. Marks. William j. marks. Current Treatment Options in Neurology, 7:237 243, 2005.

Maja J. Matarić. Reinforcement learning in the multi-robot domain. Autonomous Robots, 4 (1):73 83, 1997. doi: 10.1023/A:1008819414322.

David Q. Mayne. A second-order gradient method for determining optimal trajectories of non-linear discrete-time systems. International Journal of Control, 3(1):85 95, 1966.

Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves, Martin Riedmiller, Andreas K. Fidjeland, Georg Ostrovski, Stig Petersen, Charles Beattie, Amir Sadik, Ioannis Antonoglou, Helen King, Dharshan Kumaran, Daan Wierstra, Shane Legg, and Demis Hassabis. Human-level control through deep reinforcement learning. Nature, 518(7540):529 533, 2015.

I. Momennejad, E. M. Russek, J. H. Cheong, M. M. Botvinick, N. D. Daw, and S. J. Gershman. The successor representation in human reinforcement learning. Nature Human Behaviour, 1(9):680 692, 2017. doi: 10.1038/s41562-017-0180-8. URL https://doi.org/ 10.1038/s41562-017-0180-8.

James R. Munkres. Topology. Prentice Hall, Incorporated, 2000.

Tenavi Nakamura-Zimmerer, Qi Gong, and Wei Kang. Adaptive deep learning for highdimensional hamilton jacobi bellman equations. SIAM Journal on Scientiﬁc Computing, 43(2):A1221 A1247, 2021.

Vignesh Narayanan, Wei Zhang, and Jr-Shin Li. Duality of ensemble systems through moment representations. IEEE Transactions on Automatic Control, pages 1 8, 2024.

Katsuyuki Nishimura, Riqiang Fu, and Timothy A. Cross. The eﬀect of rf inhomogeneity on heteronuclear dipolar recoupling in solid state nmr: Practical performance of sfam and redor. Journal of Magnetic Resonance, 152(2):227 233, 2001. ISSN 1090-7807.

B. Øksendal. Stochastic Diﬀerential Equations: An Introduction with Applications. Universitext (1979). Springer Berlin, Heidelberg, 6 edition, 2003.

Open AI, :, Christopher Berner, Greg Brockman, Brooke Chan, Vicki Cheung, Przemyslaw Debiak, Christy Dennison, David Farhi, Quirin Fischer, Shariq Hashme, Chris Hesse, Rafal Józefowicz, Scott Gray, Catherine Olsson, Jakub Pachocki, Michael Petrov, Henrique P. d. O. Pinto, Jonathan Raiman, Tim Salimans, Jeremy Schlatter, Jonas Schneider, Szymon Sidor, Ilya Sutskever, Jie Tang, Filip Wolski, and Susan Zhang. Dota 2 with large scale deep reinforcement learning, 2019. URL https://arxiv.org/abs/1912.06680.

Barna Pásztor, Andreas Krause, and Ilija Bogunovic. Eﬃcient model-based multi-agent mean-ﬁeld reinforcement learning. Transactions on Machine Learning Research, 2023. ISSN 2835-8856. URL https://openreview.net/forum?id=gvc DSDYUZx.

R.K. Pathria and P.D. Beale. Statistical Mechanics. Elsevier Science, 2021.

Zhang and Li

V.I. Paulsen and M. Raghupathi. An Introduction to the Theory of Reproducing Kernel Hilbert Spaces. Cambridge Studies in Advanced Mathematics. Cambridge University Press, 2016.

I. Roos and K. Moelmer. Quantum computing with an inhomogeneously broadened ensemble of ions: Suppression of errors from detuning variations by specially adapted pulses and coherent population trapping. Physical Review A, 69:022321, 2004.

Walter Rudin. Principles of Mathematical Analysis. Mc Graw-Hill, 3 edition, 1976.

Nima Sarang and Charalambos Poullis. Tractable large-scale deep reinforcement learning. Computer Vision and Image Understanding, 232:103689, 2023. ISSN 1077-3142. doi: https: //doi.org/10.1016/j.cviu.2023.103689. URL https://www.sciencedirect.com/science/ article/pii/S1077314223000693.

Julian Schrittwieser, Ioannis Antonoglou, Thomas Hubert, Karen Simonyan, Laurent Sifre, Simon Schmitt, Arthur Guez, Edward Lockhart, Demis Hassabis, Thore Graepel, Timothy Lillicrap, and David Silver. Mastering atari, go, chess and shogi by planning with a learned model. Nature, 588(7839):604 609, 2020.

John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trust region policy optimization. In Francis Bach and David Blei, editors, Proceedings of the 32nd International Conference on Machine Learning, volume 37 of Proceedings of Machine Learning Research, pages 1889 1897, Lille, France, 07 09 Jul 2015. PMLR.

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms, 2017.

Shiva Shahrokhi, Lillian Lin, Chris Ertel, Mable Wan, and Aaron T. Becker. Steering a swarm of particles using global inputs and swarm statistics. IEEE Transactions on Robotics, 34 (1):207 219, 2018.

Ali H. Shoeb. Application of machine learning to epileptic seizure onset detection and treatment, September 2009.

David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, Yutian Chen, Timothy Lillicrap, Fan Hui, Laurent Sifre, George van den Driessche, Thore Graepel, and Demis Hassabis. Mastering the game of go without human knowledge. Nature, 550(7676):354 359, 2017. doi: 10.1038/nature24270.

M. S. Silver, R. I. Joseph, and D. I. Hoult. Selective spin inversion in nuclear magnetic resonance and coherent optics through an exact solution of the bloch-riccati equation. Physical Review A, 31(4):2753 2755, 1985.

Eduardo D. Sontag. Mathematical Control Theory: Deterministic Finite Dimensional Systems. Springer New York, NY, 2nd edition, 1998.

D. Stefanatos and J.-S. Li. Minimum-time frictionless atom cooling in harmonic traps. SIAM Journal on Control and Optimization, 49:2440 2462, 2011.

Reinforcement Learning for Infinite-Dimensional Systems

D. Stefanatos and J.-S. Li. Minimum-time quantum transport with bounded trap velocity. IEEE Transactions on Automatic Control, 59(3):733 738, 2014.

Thomas J. Stieltjes. Œuvres Complètes II - Collected Papers II. Springer Collected Works in Mathematics. Springer Berlin, Heidelberg, 1993.

Jayakumar Subramanian and Aditya Mahajan. Reinforcement learning in stationary meanﬁeld games. Richland, SC, 2019. International Foundation for Autonomous Agents and Multiagent Systems.

Richard S. Sutton and Andrew G. Barto. Reinforcement Learning An Introduction. MIT Press, 2018.

Oriol Vinyals, Igor Babuschkin, Wojciech M. Czarnecki, Michaël Mathieu, Andrew Dudzik, Junyoung Chung, David H. Choi, Richard Powell, Timo Ewalds, Petko Georgiev, Junhyuk Oh, Dan Horgan, Manuel Kroiss, Ivo Danihelka, Aja Huang, Laurent Sifre, Trevor Cai, John P. Agapiou, Max Jaderberg, Alexander S. Vezhnevets, Rémi Leblond, Tobias Pohlen, Valentin Dalibard, David Budden, Yury Sulsky, James Molloy, Tom L. Paine, Caglar Gulcehre, Ziyu Wang, Tobias Pfaﬀ, Yuhuai Wu, Roman Ring, Dani Yogatama, Dario Wünsch, Katrina Mc Kinney, Oliver Smith, Tom Schaul, Timothy Lillicrap, Koray Kavukcuoglu, Demis Hassabis, Chris Apps, and David Silver. Grandmaster level in starcraft ii using multi-agent reinforcement learning. Nature, 575(7782):350 354, 2019. doi: 10.1038/s41586-019-1724-z.

Minh Vu, Bharat Singhal, Shen Zeng, and Jr-Shin Li. Data-driven control of oscillator networks with population-level measurement. Chaos: An Interdisciplinary Journal of Nonlinear Science, 34(3):033138, 03 2024.

Scott B. Wilson. A neural network method for automatic and incremental learning applied to patient-dependent seizure detection. Clinical Neurophysiology, 116(8):1785 1795, August 2005.

Yijing Xie, Shaoshuai Mou, and Shreyas Sundaram. Communication-eﬃcient and resilient distributed q-learning. IEEE Transactions on Neural Networks and Learning Systems, 35 (3):3351 3364, 2024.

Erfu Yang and Dongbing Gu. Multiagent reinforcement learning for multi-robot systems: A survey. Technical report, tech. rep, 2004.

Lin F. Yang and Mengdi Wang. Sample-optimal parametric q-learning using linearly additive features. In International Conference on Machine Learning, 2019.

Yang Yang, Li Juntao, and Peng Lingling. Multi-robot path planning based on a deep reinforcement learning dqn algorithm. CAAI Transactions on Intelligence Technology, 5 (3):177 183, 2020a.

Yaodong Yang, Rui Luo, Minne Li, Ming Zhou, Weinan Zhang, and Jun Wang. Mean ﬁeld multi-agent reinforcement learning. In Jennifer Dy and Andreas Krause, editors, Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of

Zhang and Li

Machine Learning Research, pages 5567 5576, Stockholmsmässan, Stockholm Sweden, 10 15 Jul 2018. PMLR.

Zhuoran Yang, Chi Jin, Zhaoran Wang, Mengdi Wang, and Michael I. Jordan. On function approximation in reinforcement learning: optimism in the face of large state spaces. In Proceedings of the 34th International Conference on Neural Information Processing Systems, NIPS 20, Red Hook, NY, USA, 2020b. Curran Associates Inc.

Amir Yazdanbakhsh, Junchao Chen, and Yu Zheng. Menger: Massively large-scale distributed reinforcement learning. Neur IPS, Beyond Backpropagation Workshop, 2020, 2020. URL https://beyondbackprop.github.io/.

K osaku Yosida. Functional Analysis, volume 123 of Grundlehren der mathematischen Wissenschaften. Springer Berlin, Heidelberg, 6 edition, 1980.

Yao-Chi Yu, Wei Zhang, David O Gara, Jr-Shin Li, and Su-Hsin Chang. A moment kernel machine for clinical data mining to inform medical decision making. Scientiﬁc Reports, 13 (1):10459, 2023.

Amy Zhang, Rowan Thomas Mc Allister, Roberto Calandra, Yarin Gal, and Sergey Levine. Learning invariant representations for reinforcement learning without reconstruction. In International Conference on Learning Representations, 2021a.

Kaiqing Zhang, Zhuoran Yang, and Tamer Başar. Multi-Agent Reinforcement Learning: A Selective Overview of Theories and Algorithms. Springer International Publishing, Cham, 2021b.

Wei Zhang and Jr-Shin Li. Uniform and selective excitations of spin ensembles with rf inhomogeneity. In 2015 54th IEEE Conference on Decision and Control (CDC), pages 5766 5771, 2015. doi: 10.1109/CDC.2015.7403125.

A. Zlotnik and J.-S. Li. Optimal entrainment of neural oscillator ensembles. Journal of Neural Engineering, 9(4):046015, July 2012.