# policy_search_in_reproducing_kernel_hilbert_space__4ed7769d.pdf Policy Search in Reproducing Kernel Hilbert Space Ngo Anh Vien and Peter Englert and Marc Toussaint Machine Learning and Robotics Lab University of Stuttgart, Germany {vien.ngo, peter.englert, marc.toussaint}@ipvs.uni-stuttgart.de Modeling policies in reproducing kernel Hilbert space (RKHS) renders policy gradient reinforcement learning algorithms non-parametric. As a result, the policies become very flexible and have a rich representational potential without a predefined set of features. However, their performances might be either non-covariant under reparameterization of the chosen kernel, or very sensitive to step-size selection. In this paper, we propose to use a general framework to derive a new RKHS policy search technique. The new derivation leads to both a natural RKHS actor-critic algorithm and a RKHS expectation maximization (EM) policy search algorithm. Further, we show that kernelization enables us to learn in partially observable (POMDP) tasks which is considered daunting for parametric approaches. Via sparsification, a small set of support vectors representing the history is shown to be effectively discovered. For evaluations, we use three simulated (PO)MDP reinforcement learning tasks, and a simulated PR2 s robotic manipulation task. The results demonstrate the effectiveness of the new RKHS policy search framework in comparison to plain RKHS actor-critic, episodic natural actor-critic, plain actor-critic, and Po WER approaches. 1 Introduction Policy search methods in reinforcement learning (RL) have recently been shown to be very effective in very high dimensional robotics problems [Williams, 1992; Kober et al., 2013]. In such applications, the policy is compactly represented in a parametric space and effectively optimized using gradient-descent [Sutton et al., 1999]. Theoretically, policy gradient methods have strong convergence guarantees. In larger problems, their performance and computation can be further enhanced when combined with function approximation. The function approximation techniques might be linear [Kober et al., 2013] or non-linear [Silver et al., 2014; Lillicrap et al., 2015]. Nonetheless, parametric approaches require a careful design of the features, which is inflexible in practical use and inefficient in large problems. One can overcome this inflexibility and limited adaptability by using nonparametric techniques, esp. by modeling policy in RKHS (reproducing kernel Hilbert space) function space [Bagnell and Schneider, 2003a; Lever and Stafford, 2015; van Hoof et al., 2015]. Moreover, an adaptively compact policy representation can be achieved via sparsification in RKHS. For the case of a multi-dimensional output policy (multi-dimensional actions), a vector-valued kernel can be used [Micchelli and Pontil, 2005]. However, standard policy gradient methods might face difficult convergence problems. The issue stems from either using a non-covariant gradient or from selecting an adhoc step-size. The latter is a standard problem in any gradientbased method and is classically addressed using line-search, which may be sample-efficient. The step-size selection problem can be avoided by using EM-inspired techniques instead [Vlassis et al., 2009; Kober and Peters, 2011]. Using a noncovariant gradient means that the gradient direction might change significantly even with a simple re-parameterization of the policy representation. Fortunately, the inherent noncovariant behaviour due to coordinate transformation [Amari, 1998] can be fixed using the natural gradient with respect to one particular Riemannian metric, namely the natural metric or the Fisher information metric [Kakade, 2001; Bagnell and Schneider, 2003b; Peters and Schaal, 2008]. The EM-inspired and policy gradient algorithms have further been shown to be special cases of a common policy framework, called free-energy minimization [Kober and Peters, 2011; Neumann, 2011]. Being inspired by this general derivation, in this paper we propose a policy search framework in RKHS which results in two RKHS policy search algorithms: RKHS natural actor-critic and RKHS EM-inspired policy search. We derive the following The natural gradient in RKHS is computed with respect to the Fisher information metric, which is a by-product of considering the path-distribution manifold in policy update. We show that plain functional policy gradients are non-covariant under the change of the kernel s hyperparameters, e.g. variance in RBF kernels or free parameters in polynomial kernels. We prove that our functional natural policy gradient is a covariant update rule. As a result, we design a new Natural Actor-Critic in the RKHS framework, called RKHS-NAC. Using a Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence (IJCAI-16) simple example we show that using RKHS-AC requires very careful choice of the kernel s hyperparameters, as it drastically affects the performance and converging policy quality. By contrast, RKHS-NAC free us from that painful problem. The RKHS EM-inspired algorithm is considered as a kernelized Po WER (Policy Learning by Weighting Exploration with the Returns). As a result, both mean and variance functionals of the policy are updated analytically based on a notion of bounds on policy improvements. The general RKHS policy search framework turns out to be an efficient technique to learning in POMDPs where the policy is modeled based on a sparse set of observed histories. This nonparametric representation renders our standard RKHS policy search algorithms a promising approach in POMDP learning which is daunting for parametric ones. We compare our new framework to the recently introduced RKHS actor-critic framework (RKHS-AC) [Lever and Stafford, 2015], other parametric approaches such as the episodic natural actor-critic [Peters and Schaal, 2008], the actor-critic frameworks [Konda and Tsitsiklis, 1999], and Po WER [Kober and Peters, 2011]. The comparisons are done on benchmark RL problems in both MDP and POMDP environments, and a robotic manipulation task. The robotic task is implemented on a simulated PR2 platform. 2 Background In this section, we briefly present related background that will be used in the rest of the current paper. 2.1 Markov Decision Process A Markov decision process (MDP) is defined as 5-tuple {S, A, T , R, γ}, where S 2 φ(s), where N( ; 0, ). For brevity, we assume that the variance is a diagonal matrix of σi. The policy is improved by computing the next parameter value 0 that makes r 0L( 0; ) equal to zero. As a result, the update in the form of the reward weighted exploration with the returns is (for dimension i-th) i A (st, at) i A (st, at) If taking the derivative of the lower bound w.r.t. 0 (or σi), we would also receive a similar update expression for the variance 0. 2.3 Actor Critic in RKHS As recently introduced by [Bagnell and Schneider, 2003a; Lever and Stafford, 2015], the policies are parameterized by functionals in reproducing kernel Hilbert spaces. More specifically, the functional h(s) in Eq. 2 is an element of a vector-valued RKHS HK K(si, ) i, where i 2 A(2 Rn) (7) where the kernel K is defined as a mapping S S 7! L(A) [Micchelli and Pontil, 2005], L(A) is the space of linear operators on A. The simplest choice of K might be K(s, s0) = (s, s0)In, where In is an n n identity matrix, and is a scalar-valued kernel [Sch olkopf and Smola, 2002]. The functional gradient, i.e. the Fr echet derivative, of the term log (at|st) is computed as r logh h(at, st) = K(st, ) 1(at h(st)) 2 HK (8) Therefore, the gradient of the performance measure w.r.t. h using likelihood ratio methods is approximated by a trajectory set i of H steps as rh log P( i; h)R( i) A(st, at)K(st, ) 1(at h(st)) As a result, rh U( h) 2 HK and the functional gradient update at iteration k is hk+1 hk + rh U( hk) (10) where is a step-size. After each gradient update, the number of centres (st) of the next hk+1 might increase rapidly. One can attain sparse representation via sparsification. In addition, Lever and Stafford proved that the compatible kernel function for A (s, a), where Kh is a compatible kernel that is constructed based on K as 0 2 4 6 8 10 12 14 16 18 20 22 24 discounted cumulative reward RKHS-AC c = 1.0 RKHS-AC c = 0.2 RKHS-AC c = 0.01 0 2 4 6 8 10 12 14 16 18 20 22 24 discounted cumulative reward RKHS-AC var = 0.01 RKHS-AC var = 0.1 RKHS-AC var = 2.0 Figure 1: Discounted cumulative reward in Toy domain with polynomial and RBF kernels. The horizon H is set to 20. We report the mean performances averaged over 50 runs, and their 95% confidence interval in shading areas (s, a), (s0, a0) K(s, s0) 1(a h(s)) 1(a0 h(s0)) (11) As a result of compatible function approximation, there exists a feature space Fφ HK that contains Ah(s, a) and the weight w and satisfies Ah(s, a) = hw, rh log (a|s)i Fφ = hw, K(s, ) 1(a h(s))i HK There are a number of ways to regress Ah(s, a) from data generated from h. Lever et. al. [Lever and Stafford, 2015] used kernel pursuit matching [Vincent and Bengio, 2002] for both function regression and sparsification. Accordingly, they propose the compatible RKHS Actor-Critic framework, RKHS-AC. However, the RKHS-AC s gradient update is not covariant w.r.t. the change of the kernel s hyperparameter. We show an example use of RKHS-AC in the toy problem in Fig. 1 (see the experiment section for the problem description). We use two types of kernel to represent our functional policy in Eq. 2 with different values of hyperparameters: polynomial kernel (s, s0) = (s>s0 + c)d, where d = 5., and c = 0.01; 0.2; 1.0; and RBF kernel (s, s0) = exp( ks s0k/(2 var)), where var = 0.01; 0.1; 2.0. Note that the changes of the hyperparameters c, var lead to transformation of the coordinate system in the same corresponding feature space. The results show that RKHS-AC is non-covariant. RKHS-AC does not guarantee good improvement in all setting. For instance, in the cases of c = 0.01 and var = 2.0 some features are very peaky, this would make the policy overwhelmingly exploit non-optimal actions at early stage of learning. 3 Policy Search in RKHS Inspired by the derivation of the parametric policy search framework, we derive a new policy search framework which enables policy modeling in RKHS. Assume that the controller is a N(a; h(s), (s)), where a 2 0 such that k G Ghk Mkhk. From those results of G and G G, we can derive h Gh, Ghi = hh, G Ghi khkk G Ghk Mkhk2 that means the linear operator G is also bounded. From a set of N trajectories { i}, G can be estimated as rh log P( i, h) rh log P( i, h) N Φ( )Φ( )> where Φ( ) = [rh log P( 1; h), . . . , rh log P( N; h)]. Using Lagrange multipliers for the optimization problem in (15), the natural functional gradient is h U( h) = G 1rh U( h) N Φ( )Φ( )> + λI where R = [R( 1), . . . , R( N)], and λ is a regularizer. Using the Woodbury identity for Eq. 18, we obtain h U( h) Φ( ) Φ( )>Φ( ) + λNI K(st, ) 1(at h(st)) ˆRi where is a Gram matrix of trajectories which is built based on the trajectory kernel function K ( i, j); and ˆR = R. We now define the scalar-valued trajectory kernel K ( i, j). Note that the gradient of log trajectory distribution is rh log P( i; h) = K(st, ) 1(at h(st)) (19) K ( i, j) = K(si, sj) 1(ai h(si)) > 1(aj h(sj)) (si, ai), (sj, aj) The resulting kernel of trajectories is a sum of kernel functions of all visited state-action pairs Kh (s, a), (s0, a0) defined in Eq. 11. As a result of summing of many proper kernels, the trajectory kernel is proper and specifies a RKHS HK of trajectory feature maps. Finally, the plain natural policy gradient in RKHS, reported in Algorithm 1, has the following update ht+1 = ht + r NG h U( h) (20) 3.2 Episodic Natural Actor-Critic in RKHS We now use the above results to propose a new episodic natural actor-critic framework that enables policy modeling in RKHS. Algorithm 1 RKHS Policy Search Framework Given the kernel K, initialize h0 = 0 for k = 0, 1, 2, . . . do Sample N trajectories { i} from hk Estimate A (s, a) using { i}N i=1 to obtain w 2 HK RKHS-NPG : hk+1 = hk + r NG h U( hk) RKHS-NAC : Update the policy hk+1 = hk + kw RKHS-Po WER: Update hk+1, gk+1 as in Eq. 24 Sparsify hk+1 2 HK end for The Actor Update First, we rewrite the Fisher information operator in Eq. 16 by substituting the definition of r log P( ) in Eq. 5 to obtain G = EP ( ,h) h log (at|st) rh log (at|st)rh log (at|st)>i (21) where we use the fact: (a|s)da = 1 twice. On the other hand, we replace A(s, a) in Eq. 12 into Eq. 9 to obtain rh U( ) = EP ( ,h) hw, rh log (at|st)ir log (at|st) rh log (at|st)ir log (at|st)>w As a result, the natural gradient in RKHS used in the actor update is h U( h) = G 1Gw = w 2 HK (22) The Critic Evaluation Similar to e NAC [Peters and Schaal, 2008], we write the entire rollout of Bellman equations for an episode, and put the definition of A (st, at) in Eq. 12 to get γtr(st, at) = γt A (st, at) + V (s0) γt K(st, ) 1(at h(st))i + V (s0) where V (s0) is the value function of the starting state. One could use another kernel to estimate V (s0). We propose to use kernel matching pursuit algorithm [Vincent and Bengio, 2002] to regress A (s, a) by sampling a set of trajectories from the policy h. Putting all ingredients together, we derive the RKHS natural actor-critic framework (RKHS-NAC), as described in Algorithm 1. The algorithm uses the kernel matching pursuit method to sparsify the new functional hk+1. 3.3 EM-Inspired Policy Search in RKHS The major problem of the policy gradient approaches, like RKHS-AC and RKHS-NAC, is the step-size k. As inspired by Po WER, we propose an alternative solution for maximizing the lower bound L(h0; h) by setting its derivative rh0L(h0; h) to zero and solving for h0 to obtain A(st, at)K(s, ) 1(s) A(st, at)K(s, ) 1(s)K>(s, ) A(st, at)K(s, ) 1(s) where we exploited the reproducing property h0(s) = hh0, K(s, )i. The update rule for the variance functionals ( ) or gi( ) are similarly derived as t A(st, at)Ki(st, )Ki(st, st) 1i t A(st, at) i (the i-th dimension). For an example of how to estimate h0 and 0 given a set of N trajectories (sjt, ajt)N,H j=1,t=0 sampled from h, we assume K(s, s0) = (s, s0)In for simplicity purposes, where (si, sj) is a scalarvalued kernel. The estimates for the i-th dimension are 1 Ki(sjt, )βi where we denote Φ = (s1,0, ), . . . , (s N,H, ) , A1 = diag A(s1,0, a1,0)/gi(s1,0), . . . , A(s N,H, a N,H)/gi(s N,H) 1,0, . . . , i %>, Γ = Φ>Φ is a Gram matrix where Γij = (si, sj), to compute i = (λA 1 1 + Γ) 1A2 (the constant λ is a regularizer) 2, and jt A(sjt, ajt) Ki(sjt, sjt) PN kl A(skl, akl) The overall algorithm, named RKHS-Po WER, is reported in Algorithm 1. 3.4 Policy Search in POMDP Environment One of the very interesting characteristics of the RKHS policy search framework is learning in POMDP environments. POMDP learning requires action selection at time t: µ(at|ht) to depend on the history ht = {o0, a0, . . . , ot}, where ot is an observation. Feature design in POMDP learning is elusive for standard parametric policy search approaches, as the 1The results are based on the use of the Woodbury matrix identity 2 i is a N.H 1 vector, is reshaped to N H matrix 0 2 4 6 8 10 12 14 16 18 20 22 24 discounted cumulative reward RKHS-AC c = 1.0 RKHS-AC c = 0.2 RKHS-AC c = 0.01 RKHS-NAC c = 1.0 RKHS-NAC c = 0.2 RKHS-NAC c = 0.01 0 2 4 6 8 10 12 14 16 18 20 22 24 discounted cumulative reward RKHS-AC var = 0.01 RKHS-AC var = 0.1 RKHS-AC var = 2.0 RKHS-NAC var = 0.01 RKHS-NAC var = 0.1 RKHS-NAC var = 2.0 Figure 2: Toy domain with the use of (left) polynomial kernel, (right) RBF kernel. history space is very large and high-dimensional. It s common for the parametric methods to put regular features on each state dimension. Hence parametric methods suffer from not only the curse of dimensionality (the number of features is exponential on the number of state dimensions) but also the curse of history (the number of features is exponential on the length of a history state). In contrast, our RKHS policy search approaches discover history features adaptively and sparsely. One may find this learning similar to the noncontrolled POMDP learning using conditional embeddings of predictive state representation [Boots et al., 2013]. In our experiment, we fix the length of ht (the number of history steps) and use an RBF kernel to measure the distance between two histories. 4 Experiments We evaluate and compare on four domains (three learning in MDP and one learning in POMDP): a Toy domain, the benchmark Inverted Pendulum domain, a simulated PR2 robotic manipulation task, and the POMDP inverted Pendulum. The presented methods are: RKHS-NAC, RKHS-AC [Lever and Stafford, 2015], Po WER [Kober and Peters, 2011], e NAC [Peters and Schaal, 2008], and actor-critic. All methods use a discount factor γ = 0.99 for learning and report the cumulative discounted reward. The number of centres used in RKHS methods is comparable to the number of bases used in parametric methods. All tasks use the RBF kernel. The bandwidth of RBF kernels is chosen using the median trick. 4.1 Toy Domain The Toy domain was introduced by Lever and Stafford [Lever and Stafford, 2015]. It has a state space S 2 [ 4.0; 4.0], an action space A 2 [ 1.0; 1.0], the starting state s0 = 0, and a reward function r(s, a) = exp( |s 3|). The dynamics is st+1 = st + at + , where is a small Gaussian noise. We set N = 10, H = 20. All controllers use 20 centres. The optimal value is around 14.5 [Lever and Stafford, 2015]. We first study the covariant behaviour of the natural RKHS policy gradient method, as not seen in the standard RKHSAC described in Section 2.3. Both algorithms use line-search over a grid of 50 choices. The results are computed over 50 runs as depicted in Fig. 2. The performance of RKHS-NAC in all setting for both polynomial and RBF kernels ensures that RKHS-NAC is covariant. Figure 3 reports the comparisons of other methods without line-search. We chose the best step-size for each algorithm to 0 6 12 18 24 30 36 42 48 discounted cumulative reward RKHS-Po WER 0 20 40 60 80 100 120 140 160 180 200 discounted cumulative reward AC e NAC Po WER RKHS-AC RKHS-NAC RKHS-Po WER Figure 3: (left) Toy Domain. (right) Inverted Pendulum ensure their best performance (AC and RKHS-AC use 0.005; e NAC and RKHS-NAC use 0.1). Arbitrary choice of the stepsize may yield instability and local optimality which is not a problem with the EM-inspired methods. All RKHS methods perform comparably well and slightly better than AC and e NAC. Po WER improves slowly but very stably. 4.2 Inverted Pendulum This task [Lever and Stafford, 2015] has an action domain [ 3, 3], a state space s = ( , !) which are angular position 2 [ , ] and velocity ! 2 [ 4 , 4 ] values. The initial state is always at s0 = ( , 0). The reward function is r(s, a) = exp( 0.5 2) and The dynamics is: 0 = +0.05! + ; !0 = ! +0.05a+2 , where is a small Gaussian noise with a standard deviation of 0.02. Each transition applies the above dynamics two times (for the same action value). We use N = 10, H = 100 whose optimal return is roughly 46. We use 40 RBF centres in all controllers. The averaged performance is obtained over 15 runs. If the pendulum is balanced, the return should be at least 25. All methods use carefully selected step-sizes to achieve their best performance. The mean performance and the mean s first deviation of all algorithms are reported in Fig. 3. While RKHS-AC, AC, and Po WER are still struggling in finding the optimal policy after 200 iterations, RKHS-NAC, RKHS-Po WER and e NAC can achieve a return over 25 after 20 iterations. The performance of RKHS-NAC and e NAC are very promising in this domain which are very close to the optimal performance. 4.3 POMDP Inverted Pendulum Figure 4: The PR2 robot is opening a door. The third task is the POMDP Inverted Pendulum. We keep using the original dynamics and reward functions and add an observation function to create a learning task in POMDP. We assume that the agent can only observe a noisy angular position ot = t + 2 . We use 150 centres for all algorithms and line-search over a grid of 50 step-sizes, and the length 20 for a history state. We use only e NAC as the best parametric contender whose state space is 3-step history states and put 4 centres on each state dimension (46 features). Figure 5 reports the average performances over 10 runs and their 0 20 40 60 80 100 120 140 160 180 200 discounted cumulative reward e NAC RKHS-AC RKHS-NAC RKHS-Po WER 2 4 6 8 10 12 14 16 18 20 22 24 discounted cumulative reward RKHS-AC RKHS-NAC RKHS-Po WER Figure 5: Results for (left) the POMDP inverted pendulum;(right) the simulated robotic manipulation task. first deviations. RKHS-NAC performs best as this task has a very large history state space where bandwidth setting is very important. Because RKHS-NAC is covariant under such transformation which is yielded from changes of the bandwidth. 4.4 Robotic Manipulation Task This task is a door-opening task on a simulated PR2 robot (using the physics simulation ODE internally) and is only a comparison of the three RKHS methods. We give a demonstration via kinesthetic teaching on a physical Willow-Garage PR2 platform whose setting is as seen in Fig. 4. As action space we define a two dimensional space that defines the gripper position on the door handle. The first action is the horizontal position of the finger on the door and the second action is the finger opening width. Both actions transform the demonstration trajectory. The resulting trajectory is then executed on the simulator to return a reward value. The reward function consists of two parts: e (R1+R2). The first part measures how close the current trajectory is to the demonstration, R1. The second part is a binary penalty cost that tells whether the door is opened successful, R2. We set N = 10, H = 1. Figure 5 reports the average performance over 5 runs with a fixed step-size = 0.005 for RKHS-AC and RKHS-NAC. Each run starts from different initial poses of the robot. All three algorithms are making improvement, RKHS-Po WER s performance is improving slowly but looks more stable. 5 Conclusion We have proposed a new policy search framework that enables policy modeling in reproducing kernel Hilbert space. The main insight is in its derivation that makes the previous RKHS policy gradient framework a special case, and additionally results in two new algorithms: natural actor-critic in RKHS and the EM-inspired policy search in RKHS. While RKHS-NAC is a fix for the non-covariant RKHS-AC algorithm, the RKHS-Po WER is able to mitigate the step-size problem. Their experimental results are very promising due to the covariant behaviour and the new application track in POMDPs. The covariant behaviour seems to play a very important role in RKHS methods, because the bandwidth setting issue is not as easy as in parametric methods, especially in very large problems. Acknowledgment This work was supported by the DFG (German Science Foundation) under grant number 409/9-1 within the priority program Autonomous Learning SPP 1527 and the EU-ICT Project 3rd Hand 610878. We would like to thank Guy Lever (UCL) for fruitful discussions. [Amari, 1998] Shun-Ichi Amari. Natural gradient works ef- ficiently in learning. Neural Computation, 10(2):251 276, February 1998. [Bagnell and Schneider, 2003a] J. Andrew Bagnell and Jeff Schneider. Policy search in reproducing kernel hilbert space. Technical Report CMU-RI-TR-03-45, Robotics Institute, Pittsburgh, PA, November 2003. [Bagnell and Schneider, 2003b] J. Andrew Bagnell and Jeff G. Schneider. Covariant policy search. In IJCAI, pages 1019 1024, 2003. [Baxter and Bartlett, 2001] Jonathan Baxter and Peter L. Bartlett. Infinite-horizon policy-gradient estimation. J. Artif. Intell. Res. (JAIR), 15:319 350, 2001. [Boots et al., 2013] Byron Boots, Geoffrey J. Gordon, and Arthur Gretton. Hilbert space embeddings of predictive state representations. In UAI, 2013. [Kakade, 2001] Sham Kakade. A natural policy gradient. In Advances in Neural Information Processing Systems 14 (NIPS), pages 1531 1538, 2001. [Kober and Peters, 2011] Jens Kober and Jan Peters. Policy search for motor primitives in robotics. Machine Learning, 84(1-2):171 203, 2011. [Kober et al., 2010] Jens Kober, Erhan Oztop, and Jan Pe- ters. Reinforcement learning to adjust robot movements to new situations. In Robotics: Science and Systems VI, 2010. [Kober et al., 2013] Jens Kober, J. Andrew Bagnell, and Jan Peters. Reinforcement learning in robotics: A survey. I. J. Robotic Res., 32(11):1238 1274, 2013. [Konda and Tsitsiklis, 1999] Vijay R. Konda and John N. Tsitsiklis. Actor-critic algorithms. In Advances in Neural Information Processing Systems 12 (NIPS), pages 1008 1014, 1999. [Lever and Stafford, 2015] Guy Lever and Ronnie Stafford. Modelling policies in mdps in reproducing kernel hilbert space. In AISTATS, 2015. [Lillicrap et al., 2015] Timothy P. Lillicrap, Jonathan J. Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning. Co RR, abs/1509.02971, 2015. [Micchelli and Pontil, 2005] Charles A. Micchelli and Mas- similiano Pontil. On learning vector-valued functions. Neural Computation, 17(1):177 204, 2005. [Neumann, 2011] Gerhard Neumann. Variational inference for policy search in changing situations. In ICML, pages 817 824, 2011. [Nordebo et al., 2010] Sven Nordebo, Andreas Fhager, Mats Gustafsson, and B orje Nilsson. Fisher information integral operator and spectral decomposition for inverse scattering problems. Inverse Problems in Science and Engineering, Taylor & Francis, 18(8), 2010. [Peters and Schaal, 2007] Jan Peters and Stefan Schaal. Re- inforcement learning by reward-weighted regression for operational space control. In ICML, pages 745 750, 2007. [Peters and Schaal, 2008] Jan Peters and Stefan Schaal. Nat- ural actor-critic. Neurocomputing, 71(7-9):1180 1190, 2008. [Reed and Simon, 2003] Michael Reed and Barry Simon. Functional Analysis. Elsevier, 2003. [Sch olkopf and Smola, 2002] Bernhard Sch olkopf and Alexander J. Smola. Learning with Kernels Support Vector Machines, Regularization, Optimization, and Beyond. Adaptive Computation and Machine Learning series, MIT Press, 2002. [Silver et al., 2014] David Silver, Guy Lever, Nicolas Heess, Thomas Degris, Daan Wierstra, and Martin A. Riedmiller. Deterministic policy gradient algorithms. In ICML, pages 387 395, 2014. [Sutton et al., 1999] Richard S. Sutton, David A. Mc Allester, Satinder P. Singh, and Yishay Mansour. Policy gradient methods for reinforcement learning with function approximation. In Advances in Neural Information Processing Systems 12 (NIPS), pages 1057 1063, 1999. [van Hoof et al., 2015] Herke van Hoof, Jan Peters, and Ger- hard Neumann. Learning of non-parametric control policies with high-dimensional state features. In AISTATS, 2015. [Vincent and Bengio, 2002] Pascal Vincent and Yoshua Ben- gio. Kernel matching pursuit. Machine Learning, 48(13):165 187, 2002. [Vlassis et al., 2009] Nikos Vlassis, Marc Toussaint, Geor- gios Kontes, and Savas Piperidis. Learning model-free robot control by a monte carlo EM algorithm. Auton. Robots, 27(2):123 130, 2009. [Williams, 1992] Ronald J. Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 8:229 256, 1992.