# policy_search_in_reproducing_kernel_hilbert_space__4ed7769d.pdf

Policy Search in Reproducing Kernel Hilbert Space

Ngo Anh Vien and Peter Englert and Marc Toussaint

Machine Learning and Robotics Lab

University of Stuttgart, Germany {vien.ngo, peter.englert, marc.toussaint}@ipvs.uni-stuttgart.de

Modeling policies in reproducing kernel Hilbert space (RKHS) renders policy gradient reinforcement learning algorithms non-parametric. As a result, the policies become very ﬂexible and have a rich representational potential without a predeﬁned set of features. However, their performances might be either non-covariant under reparameterization of the chosen kernel, or very sensitive to step-size selection. In this paper, we propose to use a general framework to derive a new RKHS policy search technique. The new derivation leads to both a natural RKHS actor-critic algorithm and a RKHS expectation maximization (EM) policy search algorithm. Further, we show that kernelization enables us to learn in partially observable (POMDP) tasks which is considered daunting for parametric approaches. Via sparsiﬁcation, a small set of support vectors representing the history is shown to be effectively discovered. For evaluations, we use three simulated (PO)MDP reinforcement learning tasks, and a simulated PR2 s robotic manipulation task. The results demonstrate the effectiveness of the new RKHS policy search framework in comparison to plain RKHS actor-critic, episodic natural actor-critic, plain actor-critic, and Po WER approaches.

1 Introduction Policy search methods in reinforcement learning (RL) have recently been shown to be very effective in very high dimensional robotics problems [Williams, 1992; Kober et al., 2013]. In such applications, the policy is compactly represented in a parametric space and effectively optimized using gradient-descent [Sutton et al., 1999]. Theoretically, policy gradient methods have strong convergence guarantees. In larger problems, their performance and computation can be further enhanced when combined with function approximation. The function approximation techniques might be linear [Kober et al., 2013] or non-linear [Silver et al., 2014; Lillicrap et al., 2015].

Nonetheless, parametric approaches require a careful design of the features, which is inﬂexible in practical use and

inefﬁcient in large problems. One can overcome this inﬂexibility and limited adaptability by using nonparametric techniques, esp. by modeling policy in RKHS (reproducing kernel Hilbert space) function space [Bagnell and Schneider, 2003a; Lever and Stafford, 2015; van Hoof et al., 2015]. Moreover, an adaptively compact policy representation can be achieved via sparsiﬁcation in RKHS. For the case of a multi-dimensional output policy (multi-dimensional actions), a vector-valued kernel can be used [Micchelli and Pontil, 2005].

However, standard policy gradient methods might face difﬁcult convergence problems. The issue stems from either using a non-covariant gradient or from selecting an adhoc step-size. The latter is a standard problem in any gradientbased method and is classically addressed using line-search, which may be sample-efﬁcient. The step-size selection problem can be avoided by using EM-inspired techniques instead [Vlassis et al., 2009; Kober and Peters, 2011]. Using a noncovariant gradient means that the gradient direction might change signiﬁcantly even with a simple re-parameterization of the policy representation. Fortunately, the inherent noncovariant behaviour due to coordinate transformation [Amari, 1998] can be ﬁxed using the natural gradient with respect to one particular Riemannian metric, namely the natural metric or the Fisher information metric [Kakade, 2001; Bagnell and Schneider, 2003b; Peters and Schaal, 2008].

The EM-inspired and policy gradient algorithms have further been shown to be special cases of a common policy framework, called free-energy minimization [Kober and Peters, 2011; Neumann, 2011]. Being inspired by this general derivation, in this paper we propose a policy search framework in RKHS which results in two RKHS policy search algorithms: RKHS natural actor-critic and RKHS EM-inspired policy search. We derive the following

The natural gradient in RKHS is computed with respect

to the Fisher information metric, which is a by-product of considering the path-distribution manifold in policy update. We show that plain functional policy gradients are non-covariant under the change of the kernel s hyperparameters, e.g. variance in RBF kernels or free parameters in polynomial kernels. We prove that our functional natural policy gradient is a covariant update rule. As a result, we design a new Natural Actor-Critic in the RKHS framework, called RKHS-NAC. Using a

Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence (IJCAI-16)

simple example we show that using RKHS-AC requires very careful choice of the kernel s hyperparameters, as it drastically affects the performance and converging policy quality. By contrast, RKHS-NAC free us from that painful problem. The RKHS EM-inspired algorithm is considered as a

kernelized Po WER (Policy Learning by Weighting Exploration with the Returns). As a result, both mean and variance functionals of the policy are updated analytically based on a notion of bounds on policy improvements. The general RKHS policy search framework turns out

to be an efﬁcient technique to learning in POMDPs where the policy is modeled based on a sparse set of observed histories. This nonparametric representation renders our standard RKHS policy search algorithms a promising approach in POMDP learning which is daunting for parametric ones. We compare our new framework to the recently introduced RKHS actor-critic framework (RKHS-AC) [Lever and Stafford, 2015], other parametric approaches such as the episodic natural actor-critic [Peters and Schaal, 2008], the actor-critic frameworks [Konda and Tsitsiklis, 1999], and Po WER [Kober and Peters, 2011]. The comparisons are done on benchmark RL problems in both MDP and POMDP environments, and a robotic manipulation task. The robotic task is implemented on a simulated PR2 platform.

2 Background In this section, we brieﬂy present related background that will be used in the rest of the current paper.

2.1 Markov Decision Process A Markov decision process (MDP) is deﬁned as 5-tuple {S, A, T , R, γ}, where S 2 <m is a state space, A 2 <n is an action space, T (s, a, s0) = p(s0|s, a) is a transition function, R(s, a, s0) is a reward function, and a discount factor γ. An RL agent goes to optimize an optimal policy , without knowing T, R, that maximizes a cumulative discounted reward.

p( ; )R( )d =

where = {s0, a0, s1, a1, . . .} is a trajectory, st 2 S, at 2 A. We denote R( ) = P1

t=0 γtrt the cumulative discounted reward of a trajectory , and p( ; ) the trajectory distribution given . In this paper, we use Gaussian policies

(2) where h(s) : S 7! <n is a vector-valued function of the current state s, and is an n n covariance matrix. For instance, the linear parametrization approach of h(s) assumes that the policy is parametrized by a parameter space 2 Rd, and depends linearly on predeﬁned features φi(s), as

Similarly, nonparametric approaches with a reproducing kernel K represent h(s) = h K(s), hi (the reproducing property).

2.2 A Policy Search Framework

We brieﬂy describe a general policy framework whose main derivation is to maximize a lower bound on the expected return at the new parameter value 0.

log U( ) = log

p( ; )R( )d

p( ; 0) log p( ; )R( )

p( ; 0) d + KL

p( ; 0)||p( |R; )

p( ; 0)||p( |R; )

is the Kullback-Leibler divergence, and assuming that the rewards are positive. In order to maximize U( ), one can choose p( ; 0) / p( |R; ) to make the KL zero (E-step). Substituting this result into the above inequality to obtain new performance U( 0) (M-step) as

p( ; )R( ) log p( ; 0)

p( ; )R( )||p( ; 0)

This derivation can result in two different types of algorithms: 1) policy gradient, e.g. REINFORCE [Williams, 1992] and (Natural) Actor-Critic Algorithms [Konda and Tsitsiklis, 1999; Peters and Schaal, 2008]; 2) EM-inspired policy search, e.g. Po WER [Kober and Peters, 2011], MCEM [Vlassis et al., 2009], VIP [Neumann, 2011]. All these approaches optimize U( 0) by maximizing the lower bound L( 0, ), hence take the derivative r 0L( 0; ).

(Natural) Actor-Critic Algorithms These methods use iterative update rules to increment along the direction of the gradient ascent r U( ) = lim 0! r 0L( 0; ). Using the policy gradient theorem [Sutton et al., 1999; Baxter and Bartlett, 2001], we can derive that

r log (at|st)A (st, at)

where A (st, at) =

Q (st, at) V (st)

is called an advantage function, and Q (s, a) is the state-action Q-value function. Using function approximation, the advantage function A (s, a) is approximated as A(s, a) = Pd

i=1 wi i(s), in which the basis functions are i(s) = r i log (a|s) in order to make the function approximation compatible with the policy parameterization. The parameter is updated along the direction of the gradient ascent r U( ). Interleaving updates of the weights (the actor s parameter) and w (the critic s parameter) lead to the actor-critic framework.

With compatible function approximation, the actor update using Amari s natural gradient [Amari, 1998] is proven to become + w. This leads to the natural actor-critic framework, e.g. the episodic natural actor-critic framework [Peters and Schaal, 2008]

EM-Inspired Policy Search The methods e RWR [Peters and Schaal, 2007], Costregularized Kernel Regression (Cr KR) [Kober et al., 2010], and Po WER [Kober and Peters, 2011] introduce a notion of weighting exploration by changing the policy deﬁned in Eq. 2 to a =

%>φ(s), where N( ; 0, ). For brevity, we assume that the variance is a diagonal matrix of σi. The policy is improved by computing the next parameter value 0 that makes r 0L( 0; ) equal to zero. As a result, the update in the form of the reward weighted exploration with the returns is (for dimension i-th)

i A (st, at)

i A (st, at)

If taking the derivative of the lower bound w.r.t. 0 (or σi), we would also receive a similar update expression for the variance 0.

2.3 Actor Critic in RKHS As recently introduced by [Bagnell and Schneider, 2003a; Lever and Stafford, 2015], the policies are parameterized by functionals in reproducing kernel Hilbert spaces. More speciﬁcally, the functional h(s) in Eq. 2 is an element of a vector-valued RKHS HK

K(si, ) i, where i 2 A(2 Rn) (7)

where the kernel K is deﬁned as a mapping S S 7! L(A) [Micchelli and Pontil, 2005], L(A) is the space of linear operators on A. The simplest choice of K might be K(s, s0) = (s, s0)In, where In is an n n identity matrix, and is a scalar-valued kernel [Sch olkopf and Smola, 2002].

The functional gradient, i.e. the Fr echet derivative, of the term log (at|st) is computed as

r logh h(at, st) = K(st, ) 1(at h(st)) 2 HK (8)

Therefore, the gradient of the performance measure w.r.t. h using likelihood ratio methods is approximated by a trajectory set i of H steps as

rh log P( i; h)R( i)

A(st, at)K(st, ) 1(at h(st))

As a result, rh U( h) 2 HK and the functional gradient update at iteration k is

hk+1 hk + rh U( hk) (10)

where is a step-size. After each gradient update, the number of centres (st) of the next hk+1 might increase rapidly. One can attain sparse representation via sparsiﬁcation. In addition, Lever and Stafford proved that the compatible kernel function for A (s, a), where Kh is a compatible kernel that is constructed based on K as

0 2 4 6 8 10 12 14 16 18 20 22 24

discounted cumulative reward

RKHS-AC c = 1.0 RKHS-AC c = 0.2 RKHS-AC c = 0.01

0 2 4 6 8 10 12 14 16 18 20 22 24

discounted cumulative reward

RKHS-AC var = 0.01

RKHS-AC var = 0.1 RKHS-AC var = 2.0

Figure 1: Discounted cumulative reward in Toy domain with polynomial and RBF kernels. The horizon H is set to 20. We report the mean performances averaged over 50 runs, and their 95% conﬁdence interval in shading areas

(s, a), (s0, a0)

K(s, s0) 1(a h(s))

1(a0 h(s0))

(11) As a result of compatible function approximation, there exists a feature space Fφ HK that contains Ah(s, a) and the weight w and satisﬁes

Ah(s, a) = hw, rh log (a|s)i Fφ

= hw, K(s, ) 1(a h(s))i HK

There are a number of ways to regress Ah(s, a) from data generated from h. Lever et. al. [Lever and Stafford, 2015] used kernel pursuit matching [Vincent and Bengio, 2002] for both function regression and sparsiﬁcation. Accordingly, they propose the compatible RKHS Actor-Critic framework, RKHS-AC.

However, the RKHS-AC s gradient update is not covariant w.r.t. the change of the kernel s hyperparameter. We show an example use of RKHS-AC in the toy problem in Fig. 1 (see the experiment section for the problem description). We use two types of kernel to represent our functional policy in Eq. 2 with different values of hyperparameters: polynomial kernel (s, s0) = (s>s0 + c)d, where d = 5., and c = 0.01; 0.2; 1.0; and RBF kernel (s, s0) = exp( ks s0k/(2 var)), where var = 0.01; 0.1; 2.0. Note that the changes of the hyperparameters c, var lead to transformation of the coordinate system in the same corresponding feature space. The results show that RKHS-AC is non-covariant. RKHS-AC does not guarantee good improvement in all setting. For instance, in the cases of c = 0.01 and var = 2.0 some features are very peaky, this would make the policy overwhelmingly exploit non-optimal actions at early stage of learning.

3 Policy Search in RKHS

Inspired by the derivation of the parametric policy search framework, we derive a new policy search framework which enables policy modeling in RKHS. Assume that the controller is a N(a; h(s), (s)), where a 2 <n, h( ) 2 HK as deﬁned in Eq. 7, and ( ) is a matrix-valued function. By reshaping ( ) to a vector-valued function, one could embed it into a another vector-valued RKHS. For brevity, we assume that ( ) is a diagonal matrix-valued function which allows independent exploration in each dimension of actions.

The following derivation still applies for the cases of general matrices. The variance on each dimension is a scalarvalued functional gi( ) 2 HKi with a reproducing kernel Ki, 8i = 1, . . . , n.

The lower bound of the expected return is similarly derived as

p( ; h)R( ) log p( ; h0)

p( ; h) d / L(h0; h)

(13) Its derivative is

rh0L(h0; h) = E

rh0 log (at|st)R( )

We next derive two RKHS policy search algorithms based on the above notion of bounds on policy improvement.

3.1 (Natural) Policy Gradients in RKHS The RKHS policy gradient of Lever and Stafford [Lever and Stafford, 2015] can exactly be re-derived by computing the derivative of U(h) in Eq. 10 if we set rh U(h) = limh0!h r L(h0; h) (see Eq. 14).

In this paper, we are interested in applying the natural gradient method [Amari, 1998] to update the functional policy. The key idea is to constraint the change of p( ; h) after each gradient update. For vanilla gradient, the functional gradient direction is g = rh U( h). With the constraint of a small change in the trajectory distribution, we compute the steepest gradient direction by solving an optimization problem,

h0 . L(h0; h), s.t. KL

p( ; h)||p( ; h0)

Using the second-order Taylor expansion for the KLdivergence constraint, we can prove that the Fisher information operator [Nordebo et al., 2010] is

G = E P ( ;h)

h log P( ; h)

= E P ( ;h)

rh log P( ; h) log rh P( ; h)

in which we use the fact:

p( )d = 1 twice. We now prove an important property of G in order to derive further results: the bounded property. Consequently, there exists the inverse operator G 1 of G according to the bounded inverse theorem [Reed and Simon, 2003]. Proposition 1 The linear operator G : H 7! H is bounded.

Proof: The linear operator G is a mapping from a Hilbert space H to H, hence according to the Riesz representation theorem [Reed and Simon, 2003] there exists the adjoint of G. The adjoint is a continuous linear operator G : H 7! H such that h Gh, gi = hh, G gi, where h, g 2 H. Moreover, we can easily see that G G deﬁnes a Gram operator, hence it is positive deﬁnite and self-adjoint. Thus, G G is a bounded linear operator [Reed and Simon, 2003]. In other words, there exists a number M > 0 such that k G Ghk Mkhk. From those results of G and G G, we can derive

h Gh, Ghi = hh, G Ghi

khkk G Ghk Mkhk2

that means the linear operator G is also bounded. From a set of N trajectories { i}, G can be estimated as

rh log P( i, h) rh log P( i, h)

N Φ( )Φ( )>

where Φ( ) = [rh log P( 1; h), . . . , rh log P( N; h)]. Using Lagrange multipliers for the optimization problem in (15), the natural functional gradient is

h U( h) = G 1rh U( h)

N Φ( )Φ( )> + λI

where R = [R( 1), . . . , R( N)], and λ is a regularizer. Using the Woodbury identity for Eq. 18, we obtain

h U( h) Φ( )

Φ( )>Φ( ) + λNI

K(st, ) 1(at h(st)) ˆRi

where is a Gram matrix of trajectories which is built based on the trajectory kernel function K ( i, j); and ˆR =

R. We now deﬁne the scalar-valued trajectory kernel K ( i, j). Note that the gradient of log trajectory distribution is

rh log P( i; h) =

K(st, ) 1(at h(st)) (19)

K ( i, j) =

K(si, sj) 1(ai h(si))

> 1(aj h(sj))

(si, ai), (sj, aj)

The resulting kernel of trajectories is a sum of kernel functions of all visited state-action pairs Kh

(s, a), (s0, a0)

deﬁned in Eq. 11. As a result of summing of many proper kernels, the trajectory kernel is proper and speciﬁes a RKHS HK of trajectory feature maps.

Finally, the plain natural policy gradient in RKHS, reported in Algorithm 1, has the following update

ht+1 = ht + r NG

h U( h) (20)

3.2 Episodic Natural Actor-Critic in RKHS

We now use the above results to propose a new episodic natural actor-critic framework that enables policy modeling in RKHS.

Algorithm 1 RKHS Policy Search Framework

Given the kernel K, initialize h0 = 0 for k = 0, 1, 2, . . . do

Sample N trajectories { i} from hk Estimate A (s, a) using { i}N

i=1 to obtain w 2 HK RKHS-NPG : hk+1 = hk + r NG

h U( hk) RKHS-NAC : Update the policy hk+1 = hk + kw RKHS-Po WER: Update hk+1, gk+1 as in Eq. 24 Sparsify hk+1 2 HK end for

The Actor Update First, we rewrite the Fisher information operator in Eq. 16 by substituting the deﬁnition of r log P( ) in Eq. 5 to obtain

G = EP ( ,h)

h log (at|st)

rh log (at|st)rh log (at|st)>i (21)

where we use the fact:

(a|s)da = 1 twice. On the other hand, we replace A(s, a) in Eq. 12 into Eq. 9 to obtain

rh U( ) = EP ( ,h)

hw, rh log (at|st)ir log (at|st)

rh log (at|st)ir log (at|st)>w

As a result, the natural gradient in RKHS used in the actor update is

h U( h) = G 1Gw = w 2 HK (22)

The Critic Evaluation Similar to e NAC [Peters and Schaal, 2008], we write the entire rollout of Bellman equations for an episode, and put the deﬁnition of A (st, at) in Eq. 12 to get

γtr(st, at) =

γt A (st, at) + V (s0)

γt K(st, ) 1(at h(st))i + V (s0)

where V (s0) is the value function of the starting state. One could use another kernel to estimate V (s0). We propose to use kernel matching pursuit algorithm [Vincent and Bengio, 2002] to regress A (s, a) by sampling a set of trajectories from the policy h.

Putting all ingredients together, we derive the RKHS natural actor-critic framework (RKHS-NAC), as described in Algorithm 1. The algorithm uses the kernel matching pursuit method to sparsify the new functional hk+1.

3.3 EM-Inspired Policy Search in RKHS The major problem of the policy gradient approaches, like RKHS-AC and RKHS-NAC, is the step-size k. As inspired by Po WER, we propose an alternative solution for maximizing the lower bound L(h0; h) by setting its derivative rh0L(h0; h) to zero and solving for h0 to obtain

A(st, at)K(s, ) 1(s)

A(st, at)K(s, ) 1(s)K>(s, )

A(st, at)K(s, ) 1(s)

where we exploited the reproducing property h0(s) = hh0, K(s, )i. The update rule for the variance functionals ( ) or gi( ) are similarly derived as

t A(st, at)Ki(st, )Ki(st, st) 1i

t A(st, at)

i (the i-th dimension). For an example of how to estimate h0 and 0 given a set of N trajectories (sjt, ajt)N,H

j=1,t=0 sampled from h, we assume K(s, s0) = (s, s0)In for simplicity purposes, where (si, sj) is a scalarvalued kernel. The estimates for the i-th dimension are 1

Ki(sjt, )βi

where we denote Φ =

(s1,0, ), . . . , (s N,H, )

, A1 = diag

A(s1,0, a1,0)/gi(s1,0), . . . , A(s N,H, a N,H)/gi(s N,H)

1,0, . . . , i

%>, Γ = Φ>Φ is a Gram matrix where Γij = (si, sj), to compute i = (λA 1

1 + Γ) 1A2 (the constant λ is a regularizer) 2, and

jt A(sjt, ajt)

Ki(sjt, sjt) PN

kl A(skl, akl)

The overall algorithm, named RKHS-Po WER, is reported in Algorithm 1.

3.4 Policy Search in POMDP Environment One of the very interesting characteristics of the RKHS policy search framework is learning in POMDP environments. POMDP learning requires action selection at time t: µ(at|ht) to depend on the history ht = {o0, a0, . . . , ot}, where ot is an observation. Feature design in POMDP learning is elusive for standard parametric policy search approaches, as the

1The results are based on the use of the Woodbury matrix identity 2 i is a N.H 1 vector, is reshaped to N H matrix

0 2 4 6 8 10 12 14 16 18 20 22 24

discounted cumulative reward

RKHS-AC c = 1.0 RKHS-AC c = 0.2 RKHS-AC c = 0.01 RKHS-NAC c = 1.0 RKHS-NAC c = 0.2 RKHS-NAC c = 0.01

0 2 4 6 8 10 12 14 16 18 20 22 24

discounted cumulative reward

RKHS-AC var = 0.01

RKHS-AC var = 0.1 RKHS-AC var = 2.0 RKHS-NAC var = 0.01

RKHS-NAC var = 0.1 RKHS-NAC var = 2.0

Figure 2: Toy domain with the use of (left) polynomial kernel, (right) RBF kernel.

history space is very large and high-dimensional. It s common for the parametric methods to put regular features on each state dimension. Hence parametric methods suffer from not only the curse of dimensionality (the number of features is exponential on the number of state dimensions) but also the curse of history (the number of features is exponential on the length of a history state). In contrast, our RKHS policy search approaches discover history features adaptively and sparsely. One may ﬁnd this learning similar to the noncontrolled POMDP learning using conditional embeddings of predictive state representation [Boots et al., 2013]. In our experiment, we ﬁx the length of ht (the number of history steps) and use an RBF kernel to measure the distance between two histories.

4 Experiments

We evaluate and compare on four domains (three learning in MDP and one learning in POMDP): a Toy domain, the benchmark Inverted Pendulum domain, a simulated PR2 robotic manipulation task, and the POMDP inverted Pendulum. The presented methods are: RKHS-NAC, RKHS-AC [Lever and Stafford, 2015], Po WER [Kober and Peters, 2011], e NAC [Peters and Schaal, 2008], and actor-critic. All methods use a discount factor γ = 0.99 for learning and report the cumulative discounted reward. The number of centres used in RKHS methods is comparable to the number of bases used in parametric methods. All tasks use the RBF kernel. The bandwidth of RBF kernels is chosen using the median trick.

4.1 Toy Domain The Toy domain was introduced by Lever and Stafford [Lever and Stafford, 2015]. It has a state space S 2 [ 4.0; 4.0], an action space A 2 [ 1.0; 1.0], the starting state s0 = 0, and a reward function r(s, a) = exp( |s 3|). The dynamics is st+1 = st + at + , where is a small Gaussian noise. We set N = 10, H = 20. All controllers use 20 centres. The optimal value is around 14.5 [Lever and Stafford, 2015].

We ﬁrst study the covariant behaviour of the natural RKHS policy gradient method, as not seen in the standard RKHSAC described in Section 2.3. Both algorithms use line-search over a grid of 50 choices. The results are computed over 50 runs as depicted in Fig. 2. The performance of RKHS-NAC in all setting for both polynomial and RBF kernels ensures that RKHS-NAC is covariant.

Figure 3 reports the comparisons of other methods without line-search. We chose the best step-size for each algorithm to

0 6 12 18 24 30 36 42 48

discounted cumulative reward

RKHS-Po WER

0 20 40 60 80 100 120 140 160 180 200

discounted cumulative reward

AC e NAC Po WER RKHS-AC RKHS-NAC RKHS-Po WER

Figure 3: (left) Toy Domain. (right) Inverted Pendulum

ensure their best performance (AC and RKHS-AC use 0.005; e NAC and RKHS-NAC use 0.1). Arbitrary choice of the stepsize may yield instability and local optimality which is not a problem with the EM-inspired methods. All RKHS methods perform comparably well and slightly better than AC and e NAC. Po WER improves slowly but very stably.

4.2 Inverted Pendulum This task [Lever and Stafford, 2015] has an action domain [ 3, 3], a state space s = ( , !) which are angular position 2 [ , ] and velocity ! 2 [ 4 , 4 ] values. The initial state is always at s0 = ( , 0). The reward function is r(s, a) = exp( 0.5 2) and The dynamics is: 0 = +0.05! + ; !0 = ! +0.05a+2 , where is a small Gaussian noise with a standard deviation of 0.02. Each transition applies the above dynamics two times (for the same action value). We use N = 10, H = 100 whose optimal return is roughly 46. We use 40 RBF centres in all controllers. The averaged performance is obtained over 15 runs. If the pendulum is balanced, the return should be at least 25. All methods use carefully selected step-sizes to achieve their best performance.

The mean performance and the mean s ﬁrst deviation of all algorithms are reported in Fig. 3. While RKHS-AC, AC, and Po WER are still struggling in ﬁnding the optimal policy after 200 iterations, RKHS-NAC, RKHS-Po WER and e NAC can achieve a return over 25 after 20 iterations. The performance of RKHS-NAC and e NAC are very promising in this domain which are very close to the optimal performance.

4.3 POMDP Inverted Pendulum

Figure 4: The PR2 robot is opening a door.

The third task is the POMDP Inverted Pendulum. We keep using the original dynamics and reward functions and add an observation function to create a learning task in POMDP. We assume that the agent can only observe a noisy angular position ot = t + 2 . We use 150 centres for all algorithms and line-search over a grid of 50 step-sizes, and the length 20 for a history state. We use only e NAC as the best parametric contender whose state space is 3-step history states and put 4 centres on each state dimension (46 features). Figure 5 reports the average performances over 10 runs and their

0 20 40 60 80 100 120 140 160 180 200

discounted cumulative reward

e NAC RKHS-AC RKHS-NAC RKHS-Po WER

2 4 6 8 10 12 14 16 18 20 22 24

discounted cumulative reward

RKHS-AC RKHS-NAC RKHS-Po WER

Figure 5: Results for (left) the POMDP inverted pendulum;(right) the simulated robotic manipulation task.

ﬁrst deviations. RKHS-NAC performs best as this task has a very large history state space where bandwidth setting is very important. Because RKHS-NAC is covariant under such transformation which is yielded from changes of the bandwidth.

4.4 Robotic Manipulation Task

This task is a door-opening task on a simulated PR2 robot (using the physics simulation ODE internally) and is only a comparison of the three RKHS methods. We give a demonstration via kinesthetic teaching on a physical Willow-Garage PR2 platform whose setting is as seen in Fig. 4. As action space we deﬁne a two dimensional space that deﬁnes the gripper position on the door handle. The ﬁrst action is the horizontal position of the ﬁnger on the door and the second action is the ﬁnger opening width. Both actions transform the demonstration trajectory. The resulting trajectory is then executed on the simulator to return a reward value. The reward function consists of two parts: e (R1+R2). The ﬁrst part measures how close the current trajectory is to the demonstration, R1. The second part is a binary penalty cost that tells whether the door is opened successful, R2. We set N = 10, H = 1.

Figure 5 reports the average performance over 5 runs with a ﬁxed step-size = 0.005 for RKHS-AC and RKHS-NAC. Each run starts from different initial poses of the robot. All three algorithms are making improvement, RKHS-Po WER s performance is improving slowly but looks more stable.

5 Conclusion

We have proposed a new policy search framework that enables policy modeling in reproducing kernel Hilbert space. The main insight is in its derivation that makes the previous RKHS policy gradient framework a special case, and additionally results in two new algorithms: natural actor-critic in RKHS and the EM-inspired policy search in RKHS. While RKHS-NAC is a ﬁx for the non-covariant RKHS-AC algorithm, the RKHS-Po WER is able to mitigate the step-size problem. Their experimental results are very promising due to the covariant behaviour and the new application track in POMDPs. The covariant behaviour seems to play a very important role in RKHS methods, because the bandwidth setting issue is not as easy as in parametric methods, especially in very large problems.

Acknowledgment

This work was supported by the DFG (German Science Foundation) under grant number 409/9-1 within the priority program Autonomous Learning SPP 1527 and the EU-ICT Project 3rd Hand 610878. We would like to thank Guy Lever (UCL) for fruitful discussions.

[Amari, 1998] Shun-Ichi Amari. Natural gradient works ef-

ﬁciently in learning. Neural Computation, 10(2):251 276, February 1998.

[Bagnell and Schneider, 2003a] J. Andrew Bagnell and Jeff

Schneider. Policy search in reproducing kernel hilbert space. Technical Report CMU-RI-TR-03-45, Robotics Institute, Pittsburgh, PA, November 2003.

[Bagnell and Schneider, 2003b] J. Andrew Bagnell and Jeff G. Schneider. Covariant policy search. In IJCAI, pages 1019 1024, 2003.

[Baxter and Bartlett, 2001] Jonathan Baxter and Peter L.

Bartlett. Inﬁnite-horizon policy-gradient estimation. J. Artif. Intell. Res. (JAIR), 15:319 350, 2001.

[Boots et al., 2013] Byron Boots, Geoffrey J. Gordon, and

Arthur Gretton. Hilbert space embeddings of predictive state representations. In UAI, 2013.

[Kakade, 2001] Sham Kakade. A natural policy gradient. In

Advances in Neural Information Processing Systems 14 (NIPS), pages 1531 1538, 2001.

[Kober and Peters, 2011] Jens Kober and Jan Peters. Policy

search for motor primitives in robotics. Machine Learning, 84(1-2):171 203, 2011.

[Kober et al., 2010] Jens Kober, Erhan Oztop, and Jan Pe-

ters. Reinforcement learning to adjust robot movements to new situations. In Robotics: Science and Systems VI, 2010.

[Kober et al., 2013] Jens Kober, J. Andrew Bagnell, and Jan

Peters. Reinforcement learning in robotics: A survey. I. J. Robotic Res., 32(11):1238 1274, 2013.

[Konda and Tsitsiklis, 1999] Vijay R. Konda and John N.

Tsitsiklis. Actor-critic algorithms. In Advances in Neural Information Processing Systems 12 (NIPS), pages 1008 1014, 1999.

[Lever and Stafford, 2015] Guy Lever and Ronnie Stafford.

Modelling policies in mdps in reproducing kernel hilbert space. In AISTATS, 2015.

[Lillicrap et al., 2015] Timothy P. Lillicrap, Jonathan J. Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning. Co RR, abs/1509.02971, 2015.

[Micchelli and Pontil, 2005] Charles A. Micchelli and Mas-

similiano Pontil. On learning vector-valued functions. Neural Computation, 17(1):177 204, 2005.

[Neumann, 2011] Gerhard Neumann. Variational inference

for policy search in changing situations. In ICML, pages 817 824, 2011. [Nordebo et al., 2010] Sven Nordebo, Andreas Fhager, Mats

Gustafsson, and B orje Nilsson. Fisher information integral operator and spectral decomposition for inverse scattering problems. Inverse Problems in Science and Engineering, Taylor & Francis, 18(8), 2010. [Peters and Schaal, 2007] Jan Peters and Stefan Schaal. Re-

inforcement learning by reward-weighted regression for operational space control. In ICML, pages 745 750, 2007. [Peters and Schaal, 2008] Jan Peters and Stefan Schaal. Nat-

ural actor-critic. Neurocomputing, 71(7-9):1180 1190, 2008. [Reed and Simon, 2003] Michael Reed and Barry Simon.

Functional Analysis. Elsevier, 2003. [Sch olkopf and Smola, 2002] Bernhard Sch olkopf and Alexander J. Smola. Learning with Kernels Support Vector Machines, Regularization, Optimization, and Beyond. Adaptive Computation and Machine Learning series, MIT Press, 2002. [Silver et al., 2014] David Silver, Guy Lever, Nicolas Heess,

Thomas Degris, Daan Wierstra, and Martin A. Riedmiller. Deterministic policy gradient algorithms. In ICML, pages 387 395, 2014. [Sutton et al., 1999] Richard S. Sutton, David A. Mc Allester, Satinder P. Singh, and Yishay Mansour. Policy gradient methods for reinforcement learning with function approximation. In Advances in Neural Information Processing Systems 12 (NIPS), pages 1057 1063, 1999. [van Hoof et al., 2015] Herke van Hoof, Jan Peters, and Ger-

hard Neumann. Learning of non-parametric control policies with high-dimensional state features. In AISTATS, 2015. [Vincent and Bengio, 2002] Pascal Vincent and Yoshua Ben-

gio. Kernel matching pursuit. Machine Learning, 48(13):165 187, 2002. [Vlassis et al., 2009] Nikos Vlassis, Marc Toussaint, Geor-

gios Kontes, and Savas Piperidis. Learning model-free robot control by a monte carlo EM algorithm. Auton. Robots, 27(2):123 130, 2009. [Williams, 1992] Ronald J. Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 8:229 256, 1992.