# policy_gradient_with_kernel_quadrature__78e302da.pdf Published in Transactions on Machine Learning Research (02/2024) Policy Gradient with Kernel Quadrature Satoshi Hayakawa hayakawa@maths.ox.ac.uk Mathematical Institute, University of Oxford Tetsuro Morimura morimura_tetsuro@cyberagent.co.jp Cyber Agent, Inc. Reviewed on Open Review: https: // openreview. net/ forum? id= WFI9xh Jrx F Reward evaluation of episodes becomes a bottleneck in a broad range of reinforcement learning tasks. Our aim in this paper is to select a small but representative subset of a large batch of episodes, only on which we actually compute rewards for more efficient policy gradient iterations. We build a Gaussian process modeling of discounted returns or rewards to derive a positive definite kernel on the space of episodes, run an episodic kernel quadrature method to compress the information of sample episodes, and pass the reduced episodes to the policy network for gradient updates. We present the theoretical background of this procedure as well as its numerical illustrations in Mu Jo Co tasks. 1 Introduction Reinforcement learning (RL) aims to learn a policy model that maximizes the cumulative average of rewards (Sutton & Barto, 2018). Policy gradient algorithms operate based on gradient ascent in the policy parameter space (Gullapalli, 1990; Williams, 1992; Schulman et al., 2017) and have greatly benefited from the latest advancements in neural network models. These algorithms have found widespread applications in domains such as robotics (Peters & Schaal, 2008), large language models (Ouyang et al., 2022), medical diagnosis (Xia et al., 2020), and many others. Despite the broad applications of RL, a significant challenge, often overlooked, is the extensive computational or monetary cost associated with reward evaluations in real-world RL scenarios. Notably, in domains like material science and fluid dynamics, physical simulators are often used for evaluation of policy decisions (Fan et al., 2020; Rajak et al., 2021). These simulations tend to be computationally intensive. For example, reward calculation using flow simulation in Fan et al. (2020) requires about 1.1 hours for each episode. Other computationally demanding tasks include Neural Architecture Search (Zoph & Le, 2017) and Ordering-Based Causal Discovery (Wang et al., 2021). Furthermore, tasks like person re-identification (Liu et al., 2019) and RL with Human Feedback (RLHF) (Ouyang et al., 2022) demand human annotation for reward computation. This human annotation process often becomes a bottleneck, emphasizing the need to minimize such instances. While recent approaches like RL with AI feedback (RLAIF) (Lee et al., 2023) show promise, querying external AI typically incurs a cost, reinforcing the desire to minimize the number of reward evaluations. Given this backdrop, our primary motivation is to alleviate the computational and monetary burdens of reward evaluations in the online RL setup. We try to accelerate policy gradient methods by employing a novel approach that efficiently selects a representative subset of episodes for reward computations. Let us start with introducing necessary notations in RL and policy gradient methods. 1.1 MDP and policy We are given a state space S, an action space A, a reward function r : S A R, and transition probability p( |s, a) over S conditioned on each state-action pair along with a distribution over initial states ρ(s); the Published in Transactions on Machine Learning Research (02/2024) tuple of these defines a Markov decision process (MDP). In the RL setting, it is typically the case that both the transition probabilities p( |s, a) and the initial state distribution ρ(s) are unknown to the agent, requiring the agent to learn an effective policy through interaction with the environment. Stochastic policy π( |s) is a probability density over A (with a canonical reference measure) conditioned on each s S. Given p and π, we can generate a Markov chain that starts from a state s0 S possibly drawn from a probability distribution associated with the MDP and continues as at π( |st) and st+1 p( |st, at) for t 0. We call such a Markov chain an episode e = (st, at)t 0; it can be of finite length T = T(e) due to some termination rule of MDP. Since p is fixed in our setting, we might abuse the notation as e π to represent the stochasticity of e. Given an episode e = (st, at)t 0, the discounted return at time t is defined as Rt(e) := P u t γu tr(su, au), where γ (0, 1) is a discount rate. When the dependency on e is apparent, we might simply denote r(e) = (rt)t 0 = (r(st, at))t 0 and Rt = P u t γu tru. We finally define the Q-function Qπ : S A R and the value function V π : S R associated with the policy π by Qπ(s, a) := Ee π[R0(e)|s0 = s, a0 = a], V π(s) := Ee π[R0(e)|s0 = s] = Ea π( |s)[Qπ(s, a)]. 1.2 Policy gradient With a parameterized stochastic policy πθ, we aim to maximize J(πθ) := Ee πθ[R0(e)] with respect to θ. Its gradient can be written as θJ(πθ) = Ee π t 0 γt Rt(e) θ log πθ(at|st) t 0 γt Aπ(st, at) θ log πθ(at|st) where Aπ(s, a) := Qπ(s, a) V π(s) is the advantage function, introduced here for variance reduction. In practice, we usually approximate the gradient by Monte Carlo integration by generating episodes e1, . . . , e N iid π: i=1 ˆG(ei), (1) t 0 γt ˆAt(e) θ log πθ(at|st), (2) where ˆAt(e) is an approximation of Aπ(st, at), which requires evaluations of r(st, at) for t 0. A typical choice for ˆA is ˆAt(e) = Rt(e) Vφ(st), where Vφ is a parametric approximation of the value function, often referred to as the baseline and updated iteratively (Williams, 1992). We refer to this method as the vanilla policy gradient (vpg). We shall denote the policy parameter by θ and all the other parameters including baseline and kernel hyperparameters by φ. In order to highlight our motivation, we assume that we can separate running an episode e and evaluating the reward r(e) (and so ˆAt(e)); this policy gradient method is summarized in Algorithm 1. Algorithm 1 Policy gradient Input: A policy πθ, advantage estimator ˆA 1: for iteration = 1, 2, . . . do 2: Generate episodes (ei)N i=1 iid πθ 3: Compute (r(ei))N i=1 and then ( ˆA(ei))N i=1 4: θ θ + αN 1 PN i=1 ˆG(ei) (α: learning rate) 5: Update ˆA by using (ei, r(ei))N i=1 6: end for Published in Transactions on Machine Learning Research (02/2024) 1.3 Contribution To speed up the usual policy gradient methods, we propose combining policy gradient algorithms and kernel quadrature for reducing episodes, particularly aiming at expensive-reward situations mentioned in Section 1. A positive definite kernel K on a domain E is a two-variable function K : E E R such that, given any points e1, . . . , en S with any n, the n n matrix (K(ei, ej))n i,j=1 is symmetric and positive semi-definite. Such a kernel gives an embedding of E into a wider space called reproducing kernel Hilbert space, and K(e, e ) is an inner product between the points e and e in that space, so a kernel over episodes basically enables us to measure the magnitude and similarity of episodes. Kernel quadrature in this setting runs as follows: given a positive definite kernel of episodes K(e, e ) over a space of episodes E, we approximate the N-point empirical measure of episodes by a weighted n-point subset with a certain criterion based on K: 1 N PN i=1 δei P i I wiδei with I {1, . . . , N} and |I| = n, where δe is the delta measure at e E, which corresponds to a one-point distribution on E almost surely taking e. The empirical measure of episodes we use is from a realistic number of samples (ei)N i=1 from the stochastic policy and the MDP, which is used in the usual policy gradient algorithms. Thus, kernel quadrature works as data compression of the empirical measure to reduce the number of reward evaluations. This process in our implementation is treated as a (black-box) function KQuad(K, (ei)N i=1), the convex kernel quadrature algorithm given by Hayakawa et al. (2022), which is briefly explained in Section 3.1. However, we can use any other efficient kernel quadrature method as the KQuad function as long as it is applicable given an empirical measure and a kernel (a small ablation study is given in Appendix C). Algorithm 2 Vanilla PGKQ Input: n N, a policy πθ, advantage estimator ˆA, and episodic kernel K 1: for iteration = 1, 2, . . . do 2: Generate episodes (ei)N i=1 iid πθ 3: I, (wi)i I KQuad(K, (ei)N i=1) with |I| = n 4: Compute (r(ei))i I and then ( ˆA(ei), ˆG(ei))i I 5: θ θ + α P i I wi ˆG(ei) 6: Update ˆA by using (wi, ei, r(ei))i I 7: Update K by using (wi, ei, r(ei))i I 8: end for The proposed algorithm is given in Algorithm 2 as policy gradient with kernel quadrature (PGKQ). See Remark 3 for the computational complexity. While the detailed explanation of kernel quadrature under an MDP is described in Section 3, our contributions are summarized as follows: We develop a theory to make available the kernel quadrature over episodes, by introducing a Gaussian process modeling of returns or rewards, which can also be combined with policy gradient relatives including PPO (Schulman et al., 2017). We compare two versions of our PGKQ method with existing policy gradients with small and large batch sizes, and see the efficiency of PGKQ in catching up with the large-batch policy gradient only by small-batch reward observations. 2 Related literature 2.1 Bayesian quadrature for policy gradient The most directly relevant to our study is the application of Bayesian quadrature (O Hagan, 1991) for estimating policy gradient (BQPG; Ghavamzadeh & Engel, 2007; Ghavamzadeh et al., 2016; Akella et al., 2021). They model the Q-function Qπ by a Gaussian process (GP; see Section 3.1 for a formal definition), and estimate the policy gradient θJ(πθ) as a posterior of a vector-valued GP based upon (noisy) observations Published in Transactions on Machine Learning Research (02/2024) of Qπ. Their main contribution from the viewpoint of gradient estimate is placing a better weight than the uniform weight in (1). Although we also use GP modeling and weighted gradient estimate, our methods are different from these existing studies in the following two points. Episodes reduction. As already mentioned in Section 1.3, our objective is to approximate a large batch of episodes by a smaller batch of weighted episodes in order to reduce the number of reward computations, while the variants of BQPG are on how to better estimate the policy gradient by a fixed batch of episodes. Flexible kernel selection. As is always the case with the use of Bayesian quadrature, we need to know the exact values of some integrals associated with the covariance kernel of the GP (e.g., Eq. (16) in Ghavamzadeh & Engel, 2007) to execute BQPG. They need to use a specific class of kernel due to this limitation, while our method is valid for any choice of kernel because of the existence of a larger empirical measure that we want to approximate. 2.2 Numerical integration in data science Estimating intractable integrals with a small number of integrand evaluations classically ranges from Monte Carlo (Metropolis & Ulam, 1949), cubature (Stroud, 1971) to QMC (Dick et al., 2013), while the recent literature also includes Bayesian/kernel quadrature (O Hagan, 1991; Chen et al., 2010; Bach, 2017), recombination (Litterer & Lyons, 2012; Tchernychova, 2016), coresets (Bachem et al., 2017), determinantal point processes (DPPs; Bardenet et al., 2020), or dataset distillation (Wang et al., 2018). It is impossible to explain each method in detail, but they all agree on approximating a (probability) distribution, which is typically a continuous distribution or large discrete data, by a small (weighted) set (i.e., a discrete measure with small support). Let us mention some existing applications of these methods towards better/faster gradient estimates in data science, stochastic gradient descents (SGDs; Robbins & Monro, 1951) in particular. DPPs have already been applied to acquire better mini-batches for gradient estimates in SGDs (Zhang et al., 2017; 2019; Bardenet et al., 2021), while there also is an application of recombination with the same motivation (Cosentino et al., 2020). Small-GAN (Sinha et al., 2020) uses coresets for a better minibatch selection when training generative adversarial networks (GANs; Goodfellow et al., 2020). Kernel quadrature, a method of seeking coreset based on kernel-based discrepancy, has also been applied to iterative updates in data science; minibatch selection in a warped Bayesian quadrature (Adachi et al., 2022) and Bayesian optimization (Adachi et al., 2023a;b), and dictionary compression in model-based RL (Chakraborty et al., 2023). Our contribution compared with these studies is on the GP modeling specific to MDPs and the resulting application of kernel quadrature to suited for efficiently estimating the policy gradient. The main difficulty in applying existing sample-efficient gradient estimators for policy gradient is that points in our setting (i.e., episodes) are not simply a vector in a fixed space: their length can differ (and be very long) and each timestep affects the behavior afterward. This is where we benefit from introducing stepwise GPs and obtaining the overall kernel for a pair of episodes in Section 3.2. Therefore, none of the existing sample reduction techniques are readily applicable to our setting, to the best of our knowledge. Our framework introduced in Section 3.2 then enables us to apply general kernel quadrature methods (not limited to KQuad from Hayakawa et al. (2022); see Appendix C) to policy gradient algorithms, but selecting the best-fit kernel quadrature from the literature is not our main objective: our primary contribution is the introduction of a general framework that connects policy gradient with kernel quadrature, particularly useful in scenarios where reward evaluation is a bottleneck, and confirm that the idea itself works with the standard benchmarks in RL. 3 Kernel quadrature for reducing episodes 3.1 Kernel quadrature in general Kernel quadrature is a way of approximating a large or continuous distribution by a small discrete distribution, where the following function space plays an essential role. Published in Transactions on Machine Learning Research (02/2024) The reproducing kernel Hilbert space (RKHS) H associated with a positive definite kernel K over a space X is the (smallest) Hilbert space of functions from X to R satisfying the following properties: For each fixed x X, the univariate function Kx such that Kx( ) = K(x, ) is contained in H. For each x, y X, the inner product between Kx and Ky satsfies Kx, Ky H = K(x, y). The first property says that the space H is basically spanned by the embeddings kx of points x S, and allows linear algebra over the abstract space X. The second property corresponds to what we wrote above: k(x, y) is an inner product between the points x and y in that space (i.e., Kx and Ky). In our context, by finding an appropriate RKHS for episodes (this is equivalent to finding a positive definite kernel over episodes), we can introduce the established kernel quadrature techniques, which enables us to find an efficient discretization of randomness given by e π. In general, for a positive definite kernel K : X X R and a probability distribution µ, kernel quadrature aims to find a good quadrature rule µn = (wi, xi)n i=1 a set of weights wi R and points xi X that makes the following the worst-case error small: wce(µn; K, µ) := sup f H 1 |µn(f) µ(f)| , where H is the reproducing kernel Hilbert space (RKHS) associated with the kernel K, µn(f) := Pn i=1 wif(xi), and µ(f) := Ex µ[f(x)]. There are many algorithms for this problem, including but not limited to a greedy algorithm called herding (Chen et al., 2010; Huszár & Duvenaud, 2012; Bach et al., 2012; Tsuji et al., 2022), weighted sampling methods (Bach, 2017; Belhadji et al., 2019; 2020; Belhadji, 2021; Epperly & Moreno, 2023), a subsampling methods called thinning (Dwivedi & Mackey, 2021; 2022; Shetty et al., 2022); see Hayakawa et al. (2022, Table 1) for a comparison of these methods. We use the convex kernel quadrature (Hayakawa et al., 2022; 2023) for its empirical competence, but our method can incorporate any kernel quadrature method feasible with a general kernel and a discrete space. Convex kernel quadrature with empirical measure. Let us briefly explain a version of the KQuad algorithm in Hayakawa et al. (2022), which we use in our implementation of PGKQ. First, we are given a positive definite kernel K on a set of points (xi)N i=1. Let u1, . . . , un 1 RN be the eigenvectors (in implementation, given by the randomized SVD (Halko et al., 2011)) corresponding to the first to (n 1)-th largest eigenvalues of the Gram matrix (K(xi, xj))N i,j=1. Then, we find a vector w RN such that w ui = 1 N 1 ui for i = 1, . . . , n 1, w 0, 1 w = 1, and |w|0 n (where 1 = (1, . . . , 1) , 0 = (0, . . . , 0) RN, and | | is the number of nonzero entries), by solving a linear programming problem (particularly using the recombination algorithm (Litterer & Lyons, 2012; Tchernychova, 2016)). The nonzero entries correspond to the points and weights used in µn, or the index set I and weights wi mentioned in Section 1.3. See the code and the original paper for the actual implementations and the theoretical guarantees. Kernels and GPs. The choice of the kernel K is essential for the applications of kernel quadrature (Briol et al., 2017), and we propose exploiting the covariance kernels of GPs of the discounted returns Rt(e) or the reward function r in Section 3.2, in order to set an appropriate kernel over the space of episodes. Formally, a real-valued GP on X is a distribution over the space of functions X R, associated with a mean function m : X R and a covariance kernel function K : X X R, denoted as GP(m, K). It satisfies that, for a random function f GP(m, K) and any x1, . . . , xn X, (f(xi))n i=1 follows a normal distribution with the mean vector (m(xi))n i=1 and the covariance matrix (K(xi, xj))n i,j=1. We write the expectation with respect to the distribution of the Gaussian process we have as EGP (e.g., EGP[f] = m), and call f GP(m, K) centered if m = 0 as a function. Before going into details of the modeling in the case of MDP, we give an important relation between kernel quadrature and GP. The following is an immediate generalization of Huszár & Duvenaud (2012, Proposition 1): Published in Transactions on Machine Learning Research (02/2024) Proposition 1. Let f GP(0, K) be a centered GP on X and µ be a probability distribution satisfying R X K(x, x) dµ(x) < . Then, for a kernel quadrature rule µn on the same space, we have EGP[(µn(f) µ(f))2] = wce(µn; K, µ)2. Note that the GP being centered is essential since otherwise an additional term of (µn(m) µ(m))2 with m = EGP[f] appears. We give its proof in Section A.1 for completeness. In the context of our methods, µ is typically given by a large batch of episodes and µn is a quadrature rule with a smaller support, with which we estimate the policy gradient. For more details on the role of this proposition, see the following sections, e.g., the description after Proposition 2. 3.2 Gaussian process modeling of MDP Recall the notations for MDPs introduced in Section 1. In particular, we are given a parametric policy π = πθ, ˆAt(e) is an approximation of the advantage Aπ(st, at), and ˆGt(e) is an episodic gradient defined by (2). Also, recall that our objective is to find a good subset of episodes before evaluating rewards. Our use of GP is based on the following heuristic: For a set of episodes (ei)N i=1 and weights (wi)i I with I {1, . . . , N}, if the uncertainty of 1 N PN i=1 P t 0 γt ˆAt(ei) P i I wi P t 0 γt ˆAt(ei) is small, then the uncertainty of 1 N PN i=1 ˆG(ei) P i I wi ˆG(ei) is also expected to be small. The uncertainty mentioned above is not a mathematical term, but can artificially be introduced as a variance of a Gaussian variable based on the GP we are working with. We assume above that reducing the GP variance of the scalar-valued quantity based on ˆA is likely to result in a reduced GP (co)variance of the gradient estimate based on ˆG, which is of the vector-valued. By looking at (2), we are assuming above that ignoring the vector-valued term θ log πθ(at|st) does not so much harm the quality of the reduced episodes. This heuristic not only simplifies the derivation of kernels into the MDP context, but also provides a unified treatment to similar policy-gradient methods such as TRPO and PPO (Schulman et al., 2015; 2017) where the loss functions are given by multiplying the advantage function and other score-related terms. Under the above heuristic, what we need to do is derive a GP for the following functional of an episode: t 0 γt ˆAt(e), (3) as we can then use kernel quadrature via Proposition 1. We want to directly model ˆF, but will start by modeling smaller components such as Rt and r for more data and flexibility. In the following, we introduce two ways of modeling ˆA with an episodic GP. These are based on a simpler base GP for either the return Rt or a reward r. Although we primarily consider the estimator ˆAt(e) = Rt(e) Vφ(st), our argument can be generalized to more complicated estimators such as Rt(e) Vφ(st) γt0 t(Rt0(e) Vφ(st0)) (for a certain t0) introduced by Mnih et al. (2016). For simplicity of notation, we shall write z = (s, a), zt = (st, at) Z := S A and e = (st, at)t 0 = (zt)t 0 for state-action pairs in the following. We present two ways of GP modeling in Sections 3.2.1 & 3.2.2, and we explain how to update the GPs (mean function, covariance kernel) over iterations in Section 3.2.3. 3.2.1 Option 1: modeling Rt with GP In this model, our base GP on Z is given by Rt|zt=z GP(Vφ, kψ), (4) Published in Transactions on Machine Learning Research (02/2024) where Vφ(z) := Vφ(s) is the baseline function in the policy gradient algorithm and kψ is a positive definite kernel on the domain Z = S A with hyperparameter ψ. It formally means, for episodes e = (zt)t 0, e = (z t)t 0 and time t, u, we have (Cov denotes covariance) EGP[Rt(e)] = Vφ(st), (5) Cov GP[Rt(e)Ru(e )] = kψ(zt, z u). (6) Our intuition behind this modeling is two-fold. First, we Qπ GP(Vφ, k0,ψ) to quantify the uncertainty of Qπ. Second, we consider Rt as an estimator of Qπ(st, at), whose variance is modeled by another independent GP, i.e., Rt|zt=z GP(Qπ(z), k1,ψ); this randomness is of a different kind from the above uncertainty, since Rt is still a random variable when fixing a stochastic policy π (or a stochastic environment), while Qπ is a deterministic function. The modeling (4) is obtained by combining these GPs (kψ = k0,ψ + k1,ψ). Given the modeling for Rt, the advantage estimator ˆAt(e) = Rt(e) Vφ(st) follows a centered GP of zt (and then the episode e) with a covariance kernel kψ and we have the following kernel for ˆF. Similar computations also apply to other modeling such as Rt(e) Vφ(st) γt0 t(Rt0(e) Vφ(st0)). Proposition 2. Suppose (Rt)t 0 (as functions of an episode) follow a GP determined by (5) and (6) with kψ being bounded. Then, the functional1 ˆF(e) = P t 0 γt( ˆAt(e)) with ˆAt(e) = Rt(e) Vφ(st) follows a centered GP with a covariance kernel K given by K(e, e ) = X t,u 0 γt+ukψ(zt, z u), (7) where e = (zt)t 0, e = (z u)u 0 are episodes. Given ˆF GP(0, K) with (7), we can apply Proposition 1 to see how kernel quadrature works. Indeed, by letting f = ˆF and µ = 1 N PN i=1 δei in Proposition 1, for a quadrature rule µn = (wi, ei)i I, we have i=1 ˆF(ei) X i I wi ˆF(ei) = wce(µn; K, µ)2. Thus, provided the heuristic in Section 3.2, we expect that a good kernel quadrature (i.e., that of a small worst-case error) gives us a good gradient estimate P i I wi ˆG(ei) while the amount of actual reward computations is kept small. The actual algorithm looks like Algorithm 2, where we first update kψ (see Section 3.2.3) and then K as (7). Remark 1. We could choose to use a parametric mean function mψ(z) instead of using Vφ(s) in (4) for more detailed modeling of Rt. However, then ˆF could not represent the sum of advantage estimators and would not necessarily follow a centered GP, so we could not use Proposition 1. One way to circumvent this issue is using a hybrid gradient estimate described in the following section. 3.2.2 Option 2: modeling r with GP The object modeled by the GP in the previous section is not static over the iterations of the policy gradient in that Rt (and Qπ) is dependent on the policy π. So the update of kψ might be inaccurate for future policies. Our second option is thus modeling an static object, the reward function r: r GP(mψ, kψ), (8) where ψ is a hyperparameter for the mean function mψ and the positive definite kernel kψ on Z = S A. 1Strictly speaking, we only justify that the sum in ˆF converges in the L2(GP) space, when with infinite horizon; see Section A.2. The same applies to Proposition 3. Published in Transactions on Machine Learning Research (02/2024) However, under the modeling (8), the estimator ˆAt = Rt Vφ(st) does not necessarily follow a centered GP because EGP[ ˆAt(e)] = P u t γu tmψ(zu) Vφ(st). As already pointed out in Remark 1, we cannot simply use Proposition 1 to run an episodic kernel quadrature in this case. Instead, we can observe the following decomposition: θJ(πθ) = Ee π t 0 γt(Rt b(st))gθ(zt) t 0 γt(Rt EGP[Rt])gθ(zt) | {z } I: centered GP term t 0 γt(EGP[Rt] b(st))gθ(zt) | {z } II: bias term where b : S R is any (integrable) baseline function and gθ(zt) := θ log πθ(at|st). This decomposition can nicely be understood by introducing a fake reward mψ (instead of r) and the corresponding fake return Rψ t (e) := P t 0 γu tmψ(zu) = EGP[Rt(e)]. Indeed, the term II is given just by replacing Rt with Rψ t in the right-hand side of (9), and the remaining term I works as a gradient modification. Let us consider the context of reducing episodes from a large batch (ei)N i=1 to a weighted small batch µn = (wi, ei)i I. Since we can compute the fake rewards/returns without access to r, we can estimate the bias term II by using all the episodes (ei)N i=1. For the term I, now that the integrands Rt EGP[Rt] are centered, we can apply Proposition 1 with the following representation of the GP for the modified functional ˆFGP := P t 0 γt(Rt EGP[Rt]). Proposition 3. Let the reward r follows a GP given by (8) with mψ and kψ being bounded. Then, the functional ˆFGP(e) = P t 0 γt(Rt(e) EGP[Rt(e)]) follows a centered GP with a covariance kernel K given by K(e, e ) = X t,u 0 (1 + t)(1 + u)γt+ukψ(zt, z u), (11) where e = (zt)t 0, e = (z u)u 0 are episodes. Let us now define ˆFb := P t 0 γt(Rt b(st)) as a generalization of ˆF in the previous sections and ˆF ψ b := P t 0 γt(Rψ t b(st)) (= ˆFb ˆFGP) be its fake counterpart. Similarly to the previous section, given episodes (ei)N i=1 and µ = 1 N PN i=1 δei, we can approximate 1 N PN i=1 ˆFb(ei) by P i I wi ˆFGP(ei)+ 1 N PN i=1 ˆF ψ b (ei), where µn = (wi, ei)i I is a quadrature rule, given by running a kernel quadrature algorithm only regarding the term I. Note that this approximation is computable only with the reward evaluations over episodes (ei)i I. By using Proposition 1, their mean squared error in terms of GP, which is actually just about the error of 1 N PN i=1 ˆFGP(ei) P i I wi ˆFGP(ei), can again be represented as wce(µn; K, µ)2, where K is given by (11) regarding the term I this time. However, the gradient estimate is not as simple as P i I wi ˆG(ei) in the previous section (or the heuristic in Section 3.2), due to the use of a non-centered GP. Under the notation of (10), let us define ˆGb(e) := X t 0 γt(Rt(e) b(st))gθ(zt), ˆGGP(e) := X t 0 γt(Rt(e) EGP[Rt])gθ(zt), ˆGψ b (e) := X t 0 γt(Rψ t (e) b(st))gθ(zt), and then ˆGψ b (= ˆGb ˆGGP) is computable without access to the true rewards. By using these notations, we replace 1 N PN i=1 ˆGb(ei) with a lighter gradient estimate P i I wi ˆGGP(ei) + 1 N PN i=1 ˆGψ b (ei). The overall algorithm is given by Algorithm 3. The way we update GP parameters is described in Section 3.2.3. Published in Transactions on Machine Learning Research (02/2024) Algorithm 3 PGKQ with non-centered GP Input: n N, a policy πθ, baseline b, GP-mean mψ and covariance kψ modeling r GP(mψ, kψ) 1: for iteration = 1, 2, . . . do 2: Generate episodes (ei)N i=1 iid πθ 3: Compute K with (11) 4: I, (wi)i I KQuad(K, (ei)N i=1) with |I| = n 5: Compute (r(ei))i I and then ( ˆGGP(ei))i I 6: θ θ + α(P i I wi ˆGGP(ei) + 1 N PN i=1 ˆGψ b (ei)) 7: Update b by using (ei)N i=1 and mψ 8: Update mψ, kψ by using (wi, ei, r(ei))i I 9: end for Remark 2. We propose using b = Vφ taken from the base policy gradient algorithm. We could update Vφ by using the weighted episodes (wi, ei)i I like the updates of mψ (see Section 3.2.3), but we actually propose updating Vφ based on fake rewards mψ, which clearly separates the roles of the policy gradient (PG) and kernel quadrature (KQ), since ˆGψ b is the usual PG-estimate given the fake rewards. Indeed, the KQ side then receives episodes (ei)N i=1 and outputs a smaller batch of weighted episodes (wi, ei)i I, while the PG side runs a usual PG with all the episodes (ei)N i=1 associated with the fake reward mψ, with a modification of its gradient estimate by adding P i I wi ˆGGP(ei). By following this formulation, we can not only use ˆGb (or ˆGψ b ) but also incorporate any gradient estimator (in the PG-side for the fake-reward MDP) in Algorithm 3. See also Appendix B.1 for more implementation details, including how we combine PGKQ with PPO. 3.2.3 Updating GP networks Let us describe how we update kψ (and mψ) during the iterations of PGKQ. Updating kψ. Although we should ideally update the GP based on the Bayes rule in a purely Bayesian setting, our GPs have to treat either a non-static target function (like Rt) or too many data points since each timestep in the MDP is a data point for the GP. Following the maximum likelihood estimation (MLE) and the deep kernel learning (Wilson et al., 2016) framework, we set the loss function for kψ as Lker(kψ) = y kψ(z, z)y + log det kψ(z, z) (12) and conduct the gradient descent over the iterations, where z is given by arranging data points zt = (st, at) from all the episodes e (ei)i I, which are after the reduction of kernel quadrature, and y is the corresponding observed values, such as Rt Vφ(st) in Option 1 and r(zt) mψ(zt) in Option 2. In practice, we conduct the gradient descent by splitting (z, y) into minibatches to avoid computing the full kψ(z, z). Updating mψ for non-centered GPs. We can either pass the optimization of mψ into the deep kernel learning by using the same loss in (12), or just independently learn mψ by least-squares as we do in practice. For the least-squares learning of mψ, we can make use of the quadrature measure µn to incorporate the episode weights as Lmean = Ee µn t 0 (1 + t)γt(rt mψ(zt))2 where the discount factor (1 + t)γt comes from (11). Remark 3 (Computational complexity). Let us consider the overall complexity of the kernel quadrature with kernel learning part in PGKQ. In both options, the computation of the single K(e, e ) requires T 2-times computation of kψ (T is the episode length), and so, by combining with the kernel quadrature algorithm in Hayakawa et al. (2022) (where we use all the episodes for Nyström s method), it takes Published in Transactions on Machine Learning Research (02/2024) Hopper-v4 Half Cheetah-v4 Walker2d-v4 reward (vpg) reward (ppo) iteration iteration Inverted Double Pendulum-v4 Figure 1: Learning curves in Mujoco tasks: reward vs iteration (comparing {vpg, ppo} (blue), {vpg, ppo}-KQ (orange), {vpg, ppo}-KQ-no-mean (green), and {vpg, ppo}-large (red)) O N 2(log n + T 2) + n2N log(N/n) to get an n-episode quadrature from an N-episode sample. If we set the batch size to M in the kernel learning, then the kernel update takes O M 3 in each iteration and has O(n T/M) iterations. Thus, the overall complexity for a single iteration of the PGKQ algorithm (for kernel quadrature and kernel learning) is given by O N 2(log n + T 2) + n2N log(N/n) + n TM 2 . 4 Numerical experiments To demonstrate the effectiveness of our proposed methods, we conducted experiments on Mu Jo Co tasks since they are widely recognized as standard benchmarks in RL, even though the reward calculation for them is lightweight. We evaluated PGKQ using two options, both designed to reduce a large batch of N episodes to a smaller batch of n episodes. We adopted two conventional PG methods, vpg and PPO, as the bases for PGKQ. We also compared these methods without kernel quadrature, using either small n or large N batch sizes. Throughout this section, we maintained N = 64 and n = 8. Specifically, we compared the following algorithms: {vpg, ppo}: Corresponding algorithm with n episodes per iteration. {vpg, ppo}-KQ: Corresponding algorithm with PGKQ by modeling r (Option 2). {vpg, ppo}-KQ-no-mean: Corresponding algorithm with PGKQ by modeling Rt (Option 1). {vpg, ppo}-large: Corresponding algorithm with N episodes per iteration. In each experiment (either based on vpg or ppo), for each of the four algorithms, we ran a 500-iteration optimization (whose one iteration is based on observing n or N episodes, depending on the algorithm) ten times for statistical purposes. In all the figures, the empirical average of the total reward P t 0 r(st, at) (which is actually a finite sum in all the experiments) with its standard error (shaded region) are shown. All the experiments were conducted by using Py Torch (Paszke et al., 2019) and Adam (Kingma & Ba, 2015). Our motivation for conducting these experiments is as follows: How well can we catch up with the learning curve of {vpg, ppo}-large by using PGKQ rather than the small-batch {vpg, ppo}? So we compared the rewards against iterations in this section, but we also compare rewards against steps in Appendix B.2. In all the experiments, we used three-layer fully connected Re LU neural networks (NNs) for each of mψ and kψ, where kψ(z, z ) was computed by passing the NN-embeddings of state-action pairs z and z to the Published in Transactions on Machine Learning Research (02/2024) Gaussian kernel with additional scale and noise parameters. See the end of this section and Appendix B.1 for implementation details. Mu Jo Co tasks. We used Mu Jo Co (v2.1.0, Todorov et al., 2012) with the Gymnasium API (Towers et al., 2023)2. We compared all the aforementioned algorithms in the four following tasks: Inverted Double Pendulum-v4, Hopper-v4, Half Cheetah-v4, Walker2d-v4. In all the tasks, the maximum episode length is 1000, where the tasks except Half Cheetah-v4 can terminate earlier when the state enters a predefined unhealthy region. In these tasks, for the vpg (vanilla policy gradient) and ppo (proximal policy optimization with clipping, Schulman et al., 2017), we used the implementation of the machina3 library. The learning rates of the policy, baseline, and GP-related networks were all set to 3 10 4. This value was the default value of the machina library, but also chosen as the most successful learning rate when comparing several different learning rates with the 300-iteration, n-episode ppo in the Hopper-v4 task. The discount rate was γ = 0.995. The results are shown in Figure 1. In most tasks and base algorithms {vpg, ppo}, the order of total reward followed {vpg, ppo}-large > {vpg, ppo}-KQ > {vpg, ppo}-KQ-no-mean > {vpg, ppo}. In particular, in the experiments with vpg, PGKQ methods often get out of the plateau where the vpg is stuck. Network architecture. For the policy and baseline networks, we used the default implementations of the library (machina). Let us explain the part with our original implementation, mψ and kψ in the GP modeling. (i) Inputs: In all the Mu Jo Co tasks, we simply concatenated two vectors st and at to make zt, and used it as inputs to our networks. Let us denote by D the dimension of zt in the following. (ii) Implementation of kψ: In both experiments, we implemented kψ as follows kψ(zt, z u) = exp λψ fψ(zt) fψ(z u) 2 + (10 5 + exp(σψ))δ(zt, z u), where is the Euclidean norm, δ(zt, z u) equals 1 if zt = z u and 0 otherwise, λψ, σψ R are respectively the scale and noise parameters, and fψ : RD R10 is the embedding function explained in the following. For fψ in all the Mu Jo Co tasks, we used a fully connected neural network with two hidden layers with D (so RD RD RD R10) and the Re LU activations except for the final layer. (iii) Implementation of mψ: We used a fully connected neural network with two hidden layers with dimensions 200 and 100 (so RD R200 R100 R10) and the Re LU activations except for the final layer. Ablation study. We can use other KQuad functions than that of Hayakawa et al. (2022; 2023). Kernel thinning and kernel herding are also tested against the task Hopper-v4 in Appendix C as a small ablation study. 5 Concluding remarks We have developed a method called PGKQ, which combines existing policy gradient algorithm with kernel quadrature over episodes, aiming at better gradient estimates while keeping the number of actual reward computations small. PGKQ is based on GP models of the return Rt (Option 1) or the reward function r (Option 2), whose covariance kernel accumulated over episodes gives a kernel over the space of episodes, which allows the use of kernel quadrature for selecting a small but informative subset from a large batch of episodes. We also confirmed the competitiveness of PGKQ, especially the Option 2, by performing experiments in Mu Jo Co tasks. 2All the experiments with Mu Jo Co were conducted with a Google Cloud Vertex AI notebook with an NVIDIA T4 (16-core v CPU, 60 GB RAM). 3https://github.com/Deep X-inc/machina Published in Transactions on Machine Learning Research (02/2024) As a future direction, it should be beneficial to study a generalization of PGKQ based on vector-valued GPs as in existing BQPG approaches (Ghavamzadeh & Engel, 2007; Ghavamzadeh et al., 2016; Akella et al., 2021) to avoid the heuristic introduced in Section 3.2 for allowing the use of conventional KQ methods. While we have been focusing on online RL, extending our approach to offline RL represents another interesting direction. In such a setup, determining which samples from a large batch are essential for reward computation, given the current learning policy, becomes critically important. Masaki Adachi, Satoshi Hayakawa, Martin Jørgensen, Harald Oberhauser, and Michael A Osborne. Fast Bayesian inference with batch Bayesian quadrature via kernel recombination. In Advances in Neural Information Processing Systems, volume 35, pp. 16533 16547, 2022. Masaki Adachi, Satoshi Hayakawa, Saad Hamid, Martin Jørgensen, Harald Oberhauser, and Micheal A Osborne. SOBER: Highly parallel Bayesian optimization and bayesian quadrature over discrete and mixed spaces. ar Xiv preprint ar Xiv:2301.11832, 2023a. Masaki Adachi, Satoshi Hayakawa, Xingchen Wan, Martin Jørgensen, Harald Oberhauser, and Michael A Osborne. Domain-agnostic batch Bayesian optimization with diverse constraints via Bayesian quadrature. ar Xiv preprint ar Xiv:2306.05843, 2023b. Ravi Tej Akella, Kamyar Azizzadenesheli, Mohammad Ghavamzadeh, Animashree Anandkumar, and Yisong Yue. Deep Bayesian quadrature policy optimization. In Proceedings of the AAAI conference on artificial intelligence, volume 35, pp. 6600 6608, 2021. Francis Bach. On the equivalence between kernel quadrature rules and random feature expansions. The Journal of Machine Learning Research, 18(1):714 751, 2017. Francis Bach, Simon Lacoste-Julien, and Guillaume Obozinski. On the equivalence between herding and conditional gradient algorithms. In International Conference on Machine Learning, pp. 1355 1362, 2012. Olivier Bachem, Mario Lucic, and Andreas Krause. Practical coreset constructions for machine learning. ar Xiv preprint ar Xiv:1703.06476, 2017. Rémi Bardenet, Adrien Hardy, et al. Monte carlo with determinantal point processes. The Annals of Applied Probability, 30(1):368 417, 2020. Rémi Bardenet, Subhroshekhar Ghosh, and Meixia Lin. Determinantal point processes based on orthogonal polynomials for sampling minibatches in SGD. In Advances in Neural Information Processing Systems, volume 34, pp. 16226 16237, 2021. Ayoub Belhadji. An analysis of Ermakov Zolotukhin quadrature using kernels. In Advances in Neural Information Processing Systems, volume 34, 2021. Ayoub Belhadji, R Bardenet, and Pierre Chainais. Kernel quadrature with DPPs. In Advances in Neural Information Processing Systems, volume 32, pp. 12907 12917, 2019. Ayoub Belhadji, Rémi Bardenet, and Pierre Chainais. Kernel interpolation with continuous volume sampling. In International Conference on Machine Learning, pp. 725 735. PMLR, 2020. François-Xavier Briol, Chris J Oates, Jon Cockayne, Wilson Ye Chen, and Mark Girolami. On the sampling problem for kernel quadrature. In International Conference on Machine Learning, pp. 586 595. PMLR, 2017. Souradip Chakraborty, Amrit Singh Bedi, Pratap Tokekar, Alec Koppel, Brian Sadler, Furong Huang, and Dinesh Manocha. Posterior coreset construction with kernelized stein discrepancy for model-based reinforcement learning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, pp. 6980 6988, 2023. Published in Transactions on Machine Learning Research (02/2024) Yutian Chen, Max Welling, and Alex Smola. Super-samples from kernel herding. In Conference on Uncertainty in Artificial Intelligence, pp. 109 116, 2010. Francesco Cosentino, Harald Oberhauser, and Alessandro Abate. Carathéodory sampling for stochastic gradient descent. ar Xiv preprint ar Xiv:2006.01819, 2020. Josef Dick, Frances Y. Kuo, and Ian H. Sloan. High-dimensional integration: The quasi-Monte Carlo way. Acta Numerica, 22:133 288, 2013. Raaz Dwivedi and Lester Mackey. Kernel thinning. In Proceedings of Thirty Fourth Conference on Learning Theory, volume 134, pp. 1753 1753. PMLR, 2021. Raaz Dwivedi and Lester Mackey. Generalized kernel thinning. In International Conference on Learning Representations, 2022. Ethan N Epperly and Elvira Moreno. Kernel quadrature with randomly pivoted cholesky. In Advances in Neural Information Processing Systems, volume 36, 2023. Dixia Fan, Liu Yang, Zhicheng Wang, Michael S. Triantafyllou, and George Em Karniadakis. Reinforcement learning for bluff body active flow control in experiments and simulations. Proceedings of the National Academy of Sciences of the United States of America, 117(42), 2020. Mohammad Ghavamzadeh and Yaakov Engel. Bayesian actor-critic algorithms. In Proceedings of the 24th International Conference on Machine Learning, pp. 297 304, 2007. Mohammad Ghavamzadeh, Yaakov Engel, and Michal Valko. Bayesian policy gradient and actor-critic algorithms. The Journal of Machine Learning Research, 17(1):2319 2371, 2016. Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial networks. Communications of the ACM, 63(11): 139 144, 2020. Arthur Gretton, Karsten Borgwardt, Malte Rasch, Bernhard Schölkopf, and Alex Smola. A kernel method for the two-sample-problem. In B. Schölkopf, J. Platt, and T. Hoffman (eds.), Advances in Neural Information Processing Systems, volume 19. MIT Press, 2006. Vijaykumar Gullapalli. A stochastic reinforcement learning algorithm for learning real-valued functions. Neural Networks, 3(6):671 692, 1990. Nathan Halko, Per-Gunnar Martinsson, and Joel A Tropp. Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions. SIAM review, 53(2):217 288, 2011. Satoshi Hayakawa, Harald Oberhauser, and Terry Lyons. Positively weighted kernel quadrature via subsampling. In Advances in Neural Information Processing Systems, volume 35, pp. 6886 6900, 2022. Satoshi Hayakawa, Harald Oberhauser, and Terry Lyons. Sampling-based Nyström approximation and kernel quadrature. In Proceedings of the 40th International Conference on Machine Learning, pp. 12678 12699, 2023. Ferenc Huszár and David Duvenaud. Optimally-weighted herding is Bayesian quadrature. In Conference on Uncertainty in Artificial Intelligence, pp. 377 386, 2012. Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In Yoshua Bengio and Yann Le Cun (eds.), Proceedings of the 3rd International Conference on Learning Representations, 2015. Harrison Lee, Samrat Phatale, Hassan Mansoor, Kellie Lu, Thomas Mesnard, Colton Bishop, Victor Carbune, and Abhinav Rastogi. RLAIF: Scaling reinforcement learning from human feedback with ai feedback. ar Xiv preprint ar Xiv:2309.00267, 2023. Christian Litterer and Terry Lyons. High order recombination and an application to cubature on Wiener space. The Annals of Applied Probability, 22(4):1301 1327, 2012. Published in Transactions on Machine Learning Research (02/2024) Zimo Liu, Jingya Wang, Shaogang Gong, Huchuan Lu, and Dacheng Tao. Deep reinforcement active learning for human-in-the-loop person re-identification. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2019. Nicholas Metropolis and Stanislaw Ulam. The Monte Carlo method. Journal of the American Statistical Association, 44(247):335 341, 1949. Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforcement learning. In Proceedings of The 33rd International Conference on Machine Learning, volume 48, pp. 1928 1937. PMLR, 2016. Anthony O Hagan. Bayes Hermite quadrature. Journal of Statistical Planning and Inference, 29(3):245 260, 1991. Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, J. Schulman, Jacob Hilton, Fraser Kelton, Luke E. Miller, Maddie Simens, Amanda Askell, P. Welinder, P. Christiano, J. Leike, and Ryan J. Lowe. Training language models to follow instructions with human feedback. In Advances in Neural Information Processing Systems, volume 35, pp. 27730 27744, 2022. Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary De Vito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. Py Torch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems, volume 32, 2019. Jan Peters and Stefan Schaal. Reinforcement learning of motor skills with policy gradients. Neural Networks, 21(4):682 697, 2008. Pankaj Rajak, Aravind Krishnamoorthy, Ankit Mishra, Rajiv Kalia, Aiichiro Nakano, and Priya Vashishta. Autonomous reinforcement learning agent for chemical vapor deposition synthesis of quantum materials. npj Computational Materials, 7(1), 2021. Herbert Robbins and Sutton Monro. A stochastic approximation method. Annals of Mathematical Statistics, 22(3):400 407, 1951. John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trust region policy optimization. In Proceedings of the 32nd International Conference on Machine Learning, volume 37, pp. 1889 1897, 2015. John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. ar Xiv preprint ar Xiv:1707.06347, 2017. Abhishek Shetty, Raaz Dwivedi, and Lester Mackey. Distribution compression in near-linear time. In International Conference on Learning Representations, 2022. Samarth Sinha, Han Zhang, Anirudh Goyal, Yoshua Bengio, Hugo Larochelle, and Augustus Odena. Small GAN: Speeding up GAN training using core-sets. In Proceedings of the 37th International Conference on Machine Learning, pp. 9005 9015. PMLR, 2020. Bharath K Sriperumbudur, Arthur Gretton, Kenji Fukumizu, Bernhard Schölkopf, and Gert RG Lanckriet. Hilbert space embeddings and metrics on probability measures. The Journal of Machine Learning Research, 11:1517 1561, 2010. Arthur H Stroud. Approximate calculation of multiple integrals. Prentice-Hall, 1971. Richard S. Sutton and Andrew G. Barto. Reinforcement Learning: An Introduction. The MIT Press, second edition, 2018. URL http://incompleteideas.net/book/the-book-2nd.html. Published in Transactions on Machine Learning Research (02/2024) Maria Tchernychova. Carathéodory cubature measures. Ph D thesis, University of Oxford, 2016. Emanuel Todorov, Tom Erez, and Yuval Tassa. Mujoco: A physics engine for model-based control. In 2012 IEEE/RSJ international conference on intelligent robots and systems, pp. 5026 5033. IEEE, 2012. Mark Towers, Jordan K. Terry, Ariel Kwiatkowski, John U. Balis, Gianluca de Cola, Tristan Deleu, Manuel Goulão, Andreas Kallinteris, Arjun KG, Markus Krimmel, Rodrigo Perez-Vicente, Andrea Pierré, Sander Schulhoff, Jun Jet Tai, Andrew Tan Jin Shen, and Omar G. Younis. Gymnasium, March 2023. URL https://zenodo.org/record/8127025. Kazuma Tsuji, Ken ichiro Tanaka, and Sebastian Pokutta. Pairwise conditional gradients without swap steps and sparser kernel herding. In International Conference on Machine Learning, pp. 21864 21883. PMLR, 2022. Tongzhou Wang, Jun-Yan Zhu, Antonio Torralba, and Alexei A Efros. Dataset distillation. ar Xiv preprint ar Xiv:1811.10959, 2018. Xiaoqiang Wang, Yali Du, Shengyu Zhu, Liangjun Ke, Zhitang Chen, Jianye Hao, and Jun Wang. Orderingbased causal discovery with reinforcement learning. In Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence, IJCAI-21, pp. 3566 3573. International Joint Conferences on Artificial Intelligence Organization, 2021. Ronald J. Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 8:229 256, 1992. Andrew Gordon Wilson, Zhiting Hu, Ruslan Salakhutdinov, and Eric P. Xing. Deep kernel learning. In Arthur Gretton and Christian C. Robert (eds.), Proceedings of the 19th International Conference on Artificial Intelligence and Statistics, volume 51, pp. 370 378. PMLR, 09 11 May 2016. Yuan Xia, Jingbo Zhou, Zhenhui Shi, Chao Lu, and Haifeng Huang. Generative adversarial regularized mutual information policy gradient framework for automatic diagnosis. In AAAI Conference on Artificial Intelligence, pp. 1062 1069, 2020. Cheng Zhang, Hedvig Kjellström, and Stephan Mandt. Determinantal point processes for mini-batch diversification. In Proceedings of the 33rd Conference on Uncertainty in Artificial Intelligence, 2017. Cheng Zhang, Cengiz Öztireli, Stephan Mandt, and Giampiero Salvi. Active mini-batch sampling using repulsive point processes. In Proceedings of the AAAI conference on Artificial Intelligence, volume 33, pp. 5741 5748, 2019. Barret Zoph and Quoc Quoc Le. Neural architecture search with reinforcement learning. In International Conference on Learning Representations, 2017. A.1 Proof of Proposition 1 Proof. Suppose f GP(0, K). We first compute the following term: EGP[µ(f)2] = EGP ZZ f(x)f(y) dµ(x) dµ(y) . By Cauchy-Schwarz, we have RR |f(x)f(y)| dµ(x) dµ(y) ( R f(x)2 dµ(x))1/2( R f(y)2 dµ(y))1/2 = R f(x)2 dµ(x), so we have ZZ |f(x)f(y)| dµ(x) dµ(y) EGP Z f(x)2 dµ(x) = Z EGP f(x)2 dµ(x) = Z K(x, x) dµ(x) < , Published in Transactions on Machine Learning Research (02/2024) where we have used Fubini s theorem (thanks to the nonnegativity of the integrand) in the first equality, and the assumption on the kernel in the last inequality. Thus, we can use Fubini s theorem for the original multiple integral, and we have, by using EGP[f(x)f(y)] = K(x, y), EGP µ(f)2 = ZZ EGP[f(x)f(y)] dµ(x) dµ(y) = ZZ K(x, y) dµ(x) dµ(y). From the same use of Fubini s theorem by (partially) replacing µ with µn, we can compute the other quadratic terms and obtain EGP (µn(f) µ(f))2 = ZZ K(x, y) dµ(x) dµ(y) 2 Z K(xi, y) dµ(y) + i,j=1 wiwj K(xi, xj), which is a well-known formula for the wce(µn; K, µ)2 (Gretton et al., 2006; Sriperumbudur et al., 2010). In general, we can also treat the non-centered case f GP(m, K) if m is integrable with respect to µ. Indeed, we can apply the above computation to f m GP(0, K) to obtain EGP (µn(f m) µ(f m))2 = wce(µn; K, µ)2. From the linearity of the integral, we actually have EGP (µn(f m) µ(f m))2 = EGP ((µn(f) µ(f)) (µn(m) µ(m)))2 = EGP (µn(f) µ(f))2 2(µn(m) µ(m)) EGP[µn(f) µ(f)] | {z } =µn(m) µ(m) +(µn(m) µ(m))2 = EGP (µn(f) µ(f))2 (µn(m) µ(m))2. Thus, in general, we have EGP (µn(f) µ(f))2 = wce(µn; K, µ)2 + (µn µ(m))2. A.2 Proof of Proposition 2 Proof. From the modeling (5) and (6), ˆAt formally follows a GP on the variables (e, t) E T such that EGP h ˆAt(e) i = 0, EGP h ˆAt(e) ˆAu(e ) i = kψ(zt, z u), where E is the space of episodes and T := {0, 1, 2, . . .} is the space of time indices. Let us define the space L2(GP) as the space of R-valued centered Gaussian variables given by taking the completion of the linear space span{ ˆAt(e) | (e, t) E T } with respect to the norm X L2(GP) := EGP X2 1/2. Then, since the kernel is bounded, we have P t 0 γt ˆAt(e) L2(GP) = P t 0 γtk(zt, zt)1/2 < , and so the infinite sum ˆF(e) = P t 0 γt ˆAt(e) is well-defined in L2(GP). By letting ˆFT (e) := PT t 0 γt ˆAt(e), we have ˆFT ˆF in L2(GP). Therefore, we have that { ˆF(e) | e E} is a family of centered (jointly) Gaussian variables with EGP h ˆF(e) ˆF(e ) i = lim T lim U EGP h ˆFT (e) ˆFU(e ) i = lim T lim U u=0 γt+u EGP h ˆAt(e) ˆAu(e ) i = lim T lim U u=0 γt+ukψ(zt, z u), (14) where e = (zt)t 0, e = (z u)u 0 are episodes. Since the kernel is bounded, the sum P t,u 0 γt+ukψ(zt, z u) is absolutely convergent, and coincides with the right-hand side of (14), which completes the proof. Published in Transactions on Machine Learning Research (02/2024) A.3 Proof of Proposition 3 Proof. The flow of the proof is mostly the same as the previous one. The base GP is now r GP(mψ, kψ). This time we define L2(GP) as the space of R-valued Gaussian variable given by the completion of the linear space span({1} {r(z) | z Z}) with respect to the norm X L2(GP) := EGP X2 1/2. Since mψ and kψ are bounded, we first have that Rt(e) = P u t γu tr(zu) is well-defined in L2(GP) as u t γu tr(zu) L2(GP) = X m(zu)2 + kψ(zu, zu) < . Note that L2(GP) is a Hilbert space containing 1 with the inner product X, X L2(GP) = EGP[XX ]. Thus, the expectation, which is the inner product with 1, is continuous with respect to the norm, and so we have EGP[Rt(e)] = lim T EGP u t γu tr(zu) u=t γu tmψ(zu) = X u t γu tmψ(zu), where the last infinite sum is absolutely convergent thanks to the boundedness of mψ. We can also prove that Rt(e) EGP[Rt(e)] = P u t γu t(r(zu) mψ(zu)) by combining two convergent sequences. By letting ˆAGP t (e) := Rt(e) EGP[Rt(e)], we have a family of centered jointly Gaussian variables { ˆAGP t (e) | (e, t) E T } such that ˆAGP t (e) L2(GP) X u t γu t(r(zu) mψ(zu)) L2(GP) X kψ(zu, zu) < C for a constant C > 0 independent of t and e, where E and T are the spaces of episodes and time indices as defined in the previous proof. Thus, ˆFGP(e) = P t 0 γt ˆAGP t (e) is well-defined in L2(GP) and { ˆFGP(e) | e E} is a family of centered jointly Gaussian variables. Since the double sum P u t γu t(r(zu) mψ(zu)) L2(GP) is absolutely convergent, we can exchange the sum as ˆFGP(e) = X u t γu t(r(zu) mψ(zu)) = X u=0 γu(r(zu) mψ(zu)) = X u 0 (1 + u)γu(r(zu) mψ(zu)). Therefore, the covariance kernel for ˆFGP(e) can formally be computed as EGP h ˆFGP(e) ˆFGP(e ) i = X t,u 0 (1 + t)(1 + u)γt+u EGP[(r(zt) mψ(zt))(r(z u) mψ(z u))] t,u 0 (1 + t)(1 + u)γt+ukψ(zt, z u) for episodes e = (zt)t 0, e = (z u)u 0, which is justified by the same logic as in the previous proof. B Experimental details B.1 Combining policy gradient with kernel quadrature Recall that we have introduced in (2) the single-episode gradient estimate t 0 γt ˆAt(e) θ log πθ(at|st) (15) with an advantage estimator ˆAt. Although we have written that we use 1 N PN i=1 ˆG(ei) as the Monte Carlo gradient estimate, what we write in the actual code is the computation of the (one-dimensional) loss i=1 Lvpg[ ˆAt](ei), where Lvpg[ ˆ At](e) := X t 0 γt ˆAt(e) log πθ(at|st), Published in Transactions on Machine Learning Research (02/2024) and running an automatic differentiation to get its gradient with respect to the parameter θ. Here, ˆAt is treated as just a functional of an episode in the definition of Lvpg[ ˆAt]. Let us explain how we actually compute the policy loss in the PGKQ given a kernel quadrature rule µn = (wi, ei)i I. We start from VPG (vanilla policy gradient). (v1) VPG with kernel quadrature with a centered GP (EGP[ ˆAt] = 0, Option 1) is the easiest case. Given µn, we just replace 1 N PN i=1 Lvpg[ ˆAt](ei) with P i I wi Lvpg[ ˆAt](ei) in loss computation. (v2) When we combine VPG and a non-centered GP modeling of r (Option 2), we just exploit the decomposition (10) for loss computation (with a baseline function b): i I wi Lv I(ei) + 1 i=1 Lv II(ei), ( Lv I(e) := P t 0 γt(Rt(e) EGP[Rt(e)]) log πθ(at|st), Lv II(e) := P t 0 γt(EGP[Rt(e)] b(st)) log πθ(at|st). As the baseline, we use the value estimator Vφ trained with fake rewards as explained in Remark 2. The observation Lv I = Lvpg[Rt EGP[Rt]] and Lv II = Lvpg[EGP[Rt] b] allows a unified implementation. We can also apply PGKQ to PPO (Schulman et al., 2017). In the usual PPO, we use the probability ratio (as a functional of an episode) qθ t (e) := πθ(at|st)/πθold(at|st), where θold is the policy parameter at which we assume the episodes e1, . . . , e N have been drawn. Given an advantage estimator ˆAt as a functional of an episode, we compute the loss as follows: i=1 Lppo[ ˆAt](ei), where Lppo[ ˆAt](e) := X t 0 γt min{qθ t (e) ˆAt(e), clip(qθ t (e), 1 ε, 1 + ε) ˆAt(e)}. Here, clip(a, b, c) := min{max{a, b}, c} and ε (= 0.2 in the implementation) is a clipping parameter. (p1) PPO with the centered GP modeling (Option 1) is given by a straightforward replacement of Lvpg by Lppo (p2) When combining PPO with the Option 2, we can also imitate the decomposition of (v2): i I wi Lp I(ei)+ 1 i=1 Lp II(ei), where Lp I := Lppo[Rt EGP[Rt]], Lp II := Lppo[EGP[Rt] b]. Here, b is again the value estimator Vφ trained by fake rewards. We can also consider the use of other advantage estimators than Rt b. Indeed, ˆA t(e) = Rt(e) Vφ(st) γt0 t(Rt0(e) Vφ(st0)) with t0 being the final time was adopted in the Mu Jo Co tasks (also following the original code of machina). The modification of (v1) and (p1) is straightforward as they formally do not depend on the specific form of ˆAt. For (v2) and (p2), though there are other possibilities, we just replaced Lv II and Lp II with L v II := Lvpg[EGP[ ˆA t]] and L p II := Lppo[EGP[ ˆA t]], where EGP[ ˆA t](e) := EGP[Rt(e)] Vφ(st) γt0 t(EGP[Rt0(e)] Vφ(st0)) is regarded as a functional of an episode. B.2 Reward vs step Though we only compared the rewards against iterations in the main body, since it is common to compare the rewards against observed steps (e.g., Schulman et al., 2017), we here present our experimental results in that regard. Figure 2 is on each of the four Mu Jo Co tasks corresponding to Figure 1. Note that this is not necessarily a fair comparison to {vpg,ppo}-large in terms of the learning rate per step. Published in Transactions on Machine Learning Research (02/2024) reward (vpg) reward (ppo) Inverted Double Pendulum-v4 Hopper-v4 Walker2d-v4 Half Cheetah-v4 Figure 2: Learning curves in Mujoco Tasks: reward vs step C Ablation study To see how the performance is dependent on the choice of concrete kernel quadrature algorithms, we additionally ran the PGKQ algorithms with the kernel thinning and kernel herding (explained below), instead of the one from Hayakawa et al. (2022; 2023). The experiments were all run on the Hopper-v4 task, and all the experimental details other than algorithms are the same as in Section 4, except that we ran each algorithm five times instead of ten. We compared the following algorithms: {vpg, ppo}-KQ-{thin, herd}: PGKQ given by combining the corresponding base policy gradient (PG) and kernel quadrature (KQ) algorithms, with a GP modeling of r (Option 2, Algorithm 3). {vpg, ppo}-KQ-{thin, herd}-no-mean: PGKQ given by combining the corresponding base PG and KQ algorithms, with a GP modeling of Rt (Option 1, Algorithm 2). We can implement these algorithms by simply using thinning or herding as the KQuad function in Algorithm 2 or Algorithm 3. We shall briefly explain these algorithms below. In our setting, both algorithms just require a set EN of N points (episodes) that makes an empirical measure and a positive definite kernel K over the set, and then returns an equally weighted set of n points (N = 64 and n = 8 in our experiments). Kernel thinning. The option thin corresponds to kernel thinning (Dwivedi & Mackey, 2021; 2022) with algorithmic acceleration based on Shetty et al. (2022). This is a stochastic algorithm that runs in O N log3 N computational steps for approximating an N-point empirical measure (given by EN) by an npoint equally weighted measure (with n N). We do not describe the whole algorithm here; it is given by recursive applications of a clever randomized algorithm that divides the given set into two balanced subsets, which they call kernel halving. Kernel herding. The option herd corresponds to kernel herding (Chen et al., 2010; Bach et al., 2012). This is a greedy sequential algorithm such that, at the t-th iteration, we select the point xi E such that wce( t 1 t δxi; K, µ) is minimized over the choice of xi, where E is the set of candidate points (episodes), δ0 is the zero measure, K is a given positive definite kernel over E, and µ is the target measure, which we want to approximate. Since we have no access to the true mean embedding in our setting, we let E := EN and µ := 1 N PN i=1 δei be the uniform measure over EN = (ei)N i=1. The minimization in each step is done by an exhaustive search over EN, and the algorithm runs in O N 2 computational steps in our implementation. Published in Transactions on Machine Learning Research (02/2024) reward (vpg) reward (ppo) Figure 3: Learning curves of PGKQ variants in Hopper-v4: reward vs iteration and step The results are given in Figure 3. By comparing it with Figure 1 and Figure 2, we can see that the thinningbased methods perform well; they give similar results to the PGKQ algorithms presented in the main body of the paper. On the contrary, the herding-based methods perform worse than those based on the other two KQ algorithms, while {vpg, ppo}-KQ-herd certainly improve upon the small-batch {vpg, ppo}. Thus, we propose to use the original PGKQ in the main body or the thinning-based variant as the first choice. However, note that this is a small ablation study based on a particular task and it requires further investigation to determine which KQ algorithm should be adopted in PGKQ.