# sparse_latent_space_policy_search__9d0e6962.pdf

Sparse Latent Space Policy Search

Kevin Sebastian Luck Arizona State University Interactive Robotics Lab AZ 85281 Tempe, USA mail@kevin-luck.net

Joni Pajarinen Aalto University Intelligent Robotics Group 02150 Espoo, Finland Joni.Pajarinen@aalto.ﬁ

Erik Berger Technical University Bergakademie Freiberg Institute of Computer Science 09599 Freiberg, Germany erik.berger@informatik.tu-freiberg.de

Ville Kyrki Aalto University Intelligent Robotics Group 02150 Espoo, Finland ville.kyrki@aalto.ﬁ

Heni Ben Amor Arizona State University Interactive Robotics Lab AZ 85281 Tempe, USA hbenamor@asu.edu

Computational agents often need to learn policies that involve many control variables, e.g., a robot needs to control several joints simultaneously. Learning a policy with a high number of parameters, however, usually requires a large number of training samples. We introduce a reinforcement learning method for sampleefﬁcient policy search that exploits correlations between control variables. Such correlations are particularly frequent in motor skill learning tasks. The introduced method uses Variational Inference to estimate policy parameters, while at the same time uncovering a lowdimensional latent space of controls. Prior knowledge about the task and the structure of the learning agent can be provided by specifying groups of potentially correlated parameters. This information is then used to impose sparsity constraints on the mapping between the high-dimensional space of controls and a lowerdimensional latent space. In experiments with a simulated bi-manual manipulator, the new approach effectively identiﬁes synergies between joints, performs efﬁcient low-dimensional policy search, and outperforms state-of-the-art policy search methods.

Introduction Reinforcement learning (RL) is a promising approach to automated motor skill acquisition (Peters et al. 2011). Instead of a human hand-coding speciﬁc controllers, an agent autonomously explores the task at hand through trial-and-error and learns necessary movements. Yet, reinforcement learning of motor skills is also considered to be a challenging problem, since it requires sample-efﬁcient learning in highdimensional state and action spaces. A possible strategy to address this challenge can be found in the human motor control literature (Bernstein 1967). Research on human motor control provides evidence for motor synergies; joint coactivations of a set of muscles from a smaller number of neural commands. The reduction in involved parameters results

Copyright c 2016, Association for the Advancement of Artiﬁcial Intelligence (www.aaai.org). All rights reserved.

in a lower-dimensional latent space for control which, in turn, reduces cognitive effort and training time during skill acquisition. The existence of synergies has been reported in a variety of human motor tasks, e.g., grasping (Santello, Flanders, and Soechting 1998), walking (Wang, O Dwyer, and Halaki 2013), or balancing (Torres-Oviedo and Ting 2010). Recently, various synergy-inspired strategies have been put forward to improve the efﬁciency of RL for motor skill acquisition (Bitzer, Howard, and Vijayakumar 2010; Kolter and Ng 2007). Typically, these approaches use dimensionality reduction as a pre-processing step in order to extract a lower-dimensional latent space of control variables. However, extracting the latent space using standard dimensionality reduction techniques requires a signiﬁcantly large training set of (approximate) solutions, prior simulations, or human demonstrations. Even if such data exists, it may drastically bias the search by limiting it to the subspace of initially provided solutions. In our previous work, we introduced an alternative approach called latent space policy search that tightly integrates RL and dimensionality reduction (Luck et al. 2014). Using an expectation-maximization (EM) framework (Dempster, Laird, and Rubin 1977) we presented a latent space policy search algorithm that iteratively reﬁnes both the estimates of the low-dimensional latent space, as well as the policy parameters. Only samples produced during the search process were used. In this paper, we propose a different kind of latent space policy search approach, which similarly to our previous work combines RL and dimensionality reduction, but which also allows for prior structural knowledge to be included. Our method is based on the Variational Bayes (VB) (Neumann 2011; van de Meent et al. 2015) framework. Variational Bayes is a Bayesian generalization of the expectationmaximization algorithm, which returns a distribution over optimal parameters instead of a single point estimate. It is a powerful framework for approximating integrals that would otherwise be intractable. Our RL algorithm exploits these properties in order to (1) perform efﬁcient policy search, (2) infer the low-dimensional latent space of the task, and

Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence (AAAI-16)

Figure 1: The main idea of Group Factor Policy Search: A number of variables, for example the joints of an arm or leg of a NAO robot, form one group. Given several of such groups for the action vector (left matrix) the transformation matrix W can be divided in several submatrices corresponding to those groups. Subsequently each factor, given by a column in W, encodes information for all groups, e.g. four in the example given above. Factors may be non-zero for all groups, for a subset of groups, for exactly one group or zero for all groups. In the ﬁgure, grey areas correspond to non-zero values and white areas to zero values in the sparse transformation matrix. The transformation matrix is multiplied by the latent variables given by Z = ( z1, , zt, , z T ) distributed by zt N(0, trace(φ(st, t)φ(st, t)T)I).

(3) incorporate prior structural information. Prior knowledge about locality of synergies can be included by specifying distinct groups of correlated sub-components. Often such prior knowledge about groups of variables, e.g. coactivated joints and limbs, is readily available from the mechanical structure of a system. Structural prior knowledge is also common in other application domains. For example, in a wireless network the network topology deﬁnes receiver groups (Sagduyu and Ephremides 2004). Our approach draws inspiration and incorporates ideas from Factor Analysis, in particular Group Factor Analysis (Klami et al. 2015), as can be seen in Fig. 1. Groups of variables, e.g., robot joints grouped into arms and legs, are provided as prior structural knowledge by a user. A factorized control policy is then learned through RL, which includes a transformation matrix W. The transformation matrix holds factors that describe dependencies between either all of the groups or a subset of them. The individual factors can be regarded as synergies among the joints of the robot. We will show that the resulting algorithm effectively ties together prior structural knowledge, latent space identiﬁcation, and policy search in a coherent way.

Policy Search

Policy search methods try to ﬁnd an optimal policy for an agent which acts in an uncertain world with an unknown world model. At each time step t the agent executes an action at in state st and moves to the next state st+1 with probability p(st+1|st, at). After executing a certain number of actions, the agent receives a reward feedback given by an unknown reward function based on the performed execution trace (or trajectory/history) τ = (s1, a1, . . . , s T , a T , s T +1). The overall objective in policy search is to maximize the expected reward over trajectories and policy parameters θ. For bounded rewards, maximizing expected reward is equivalent to maximizing the probability of a binary reward r (Toussaint and Storkey 2006):

Eτ [r = 1] = p(τ, θ)p(r = 1|τ)dθdτ, (1)

where the probability of the trajectory p(τ, θ) contains the (stochastic) policy, r is a binary variable indicating maximum reward, and p(r = 1|τ) exp { c (τ)} (Toussaint 2009) is the conditional probability of receiving maximum expected reward given a cost function. Assuming the Markov property and the independence of actions, the probability of a trajectory can be written as

p(τ, θ) = p(θ)p(s1)

t=1 p(st+1|st, at)π(at|st, θ). (2)

The stochastic policy π(at|st, θ) depends on the parameters θ for which we additionally introduce prior distributions p(θ). This formulation will subsequently be used for structuring the policy model. The prior distributions may also depend on hyperparameters for reasons of clarity, however, we will omit any such parameters below. Furthermore, we assume that the initial state distribution p(s1) and transition dynamics p(st+1|st, at) are unknown but ﬁxed. Thus, they will cancel out as constant values.

Group Factor Policy Search We will now introduce a new policy search method, called Group Factor Policy Search (Grou PS ), that uncovers the latent space on-the-ﬂy based on prior structural information. In this section, we discuss how to incrementally improve the policy and the actual form of the new policy model. We parameterize the policy using Group Factor Analysis (Klami et al. 2015) in order to utilize prior information about the parameters and their correlations. Since our policy is a linear stochastic model with prior distributions, we ﬁrst present a novel general Variational Inference framework for policy search that takes priors into account. Subsequently, we discuss how the policy is parameterized, and ﬁnally show

the policy model update equations for Group Factor Policy Search which we derive using the introduced Variational Inference method.

Variational Inference for Policy Search In each iteration our new policy search method samples a distribution over trajectories pold(τ) using the current policy, and based on pold(τ) ﬁnds a new policy which maximizes a lower bound on the expected reward. This is repeated until convergence. In order to ﬁnd a new policy based on samples from the old one, we introduce the sampling distribution pold(τ) and the approximated parameter distribution q(θ) (deﬁned later) into Equation 1. By applying the log-function and using Jensen s inequality (Kober and Peters 2009; Bishop 2006, Eq. (1.115)) we derive the lower bound

log pold(τ)q(θ) p(τ, θ) pold(τ)q(θ)p(r = 1|τ)dθdτ

pold(τ)q(θ) log p(τ, θ) pold(τ)q(θ)

p(r = 1|τ)dθdτ.

Since pold(τ) was generated using the old policy it does not depend on θ and we can simplify the lower bound to pold(τ)q(θ) log p(τ, θ) pold(τ)q(θ)

p(r = 1|τ)dθdτ

= const + pold(τ)q(θ)

t=1 π(at|θ, st)

p(r = 1|τ)dθdτ,

where the constant term can be ignored for the maximization of this term. By assuming the factorization q(θ) = qi(θi) for the parameters and applying the Variational Bayes approach, we get the approximated distributions of the parameters:

log qj(θj) = const +

i =j qi(θi)

pold(τ) log

t=1 π(at, θ|st)p(r = 1|τ)

where the parameter vector θ j contains all parameters except θj. The normalization constant R is given by the integral R = pold(τ)p(r = 1|τ)dτ. (6)

Formulation of Group Factor Policy Search In order to identify sets of correlated variables during policy search, we use a linear stochastic policy of a form similar to the model used in Group Factor Analysis (GFA) (Klami

Input: Reward function R ( ) and initializations of parameters. Choose number of latent dimension n. Set ﬁxed hyper-parameters a τ, b τ, aα, bα, σ2 and deﬁne groupings.

2 while reward not converged do

3 for h=1:H do # Sample H rollouts

4 for t=1:T do

5 at = Wi Zφ + Mφ + Eφ

6 with Z N (0, I) and E N (0, τ), where τ (m) = τm I

7 Execute action at

8 Observe and store reward R (τ)

9 Initialization of q-distribution

10 while not converged do

11 Update q (M) with Eq. (16)

12 Update q (W) with Eq. (19)

13 Update q Z with Eq. (22)

14 Update q (α) with Eq. (12)

15 Update q ( τ) with Eq. (25)

16 M = Eq(M) [M]

17 W = Eq(W) [W]

18 α = Eq(α) [α]

19 τ = Eq( τ) [ τ]

Result: Linear weights M for the feature vector φ, representing the ﬁnal policy. The columns of W represents the factors of the latent space.

Algorithm 1: Outline of the Group Factor Policy Search (Grou PS) algorithm.

et al. 2015). The main idea of GFA is to introduce prior distributions for the parameters, in particular a prior for a structured transformation matrix W. The transformation matrix, responsible for mapping between a low-dimensional subspace and the original high-dimensional space, is forced to be sparse and constructed using prior knowledge about grouping of the dimensions, that is, W is a concatenation of transform matrices W(m) for each group m. For example, if the dimensions of a vector represent the joints of a legged robot, we can group joints belonging to the same leg into the same group (see e.g. Fig. 1). Applying the idea of Group Factor Analysis for directed sampling leads to a linear model, i.e. a stochastic policy

a(m) t = W(m)Zt + M(m) + E(m) t φ (st, t) , (7)

where, for group m, the action a(m) t RDm 1 is a linear projection of a feature vector φ (st, t) Rp 1. Each dimension of the feature vector is given by a basis function, which may depend on the current state and/or time. In the remainder of the paper, we will write φ instead of φ (s, t)

for simplicity, even though there is an implicit dependency of φ on the current state of a trajectory. W(m) RDm l is a transformation matrix mapping from the l-dimensional subspace to the original space. Each entry of the latent matrix Zt Rl p is distributed according to a standard normal distribution where N (0, 1), M(m) RDm p is the mean matrix, and the entries of the noise matrix E(m) t RDm p are distributed by N(0, τ 1 m ). We can derive a stochastic policy from the model deﬁned in Equation 7. Since

Ztφ N(0, trace(φφT)I) (8)

holds (see e.g. (Luck et al. 2014)), we can substitute Ztφ by the random variable zt Rl 1 resulting in the policy

π(at|θ, st) =

W(m) zt + M(m)φ, Tr φφT

If we take a closer look at the latent space given by W zt we ﬁrst ﬁnd that the length of each factor is determined by φ(st, t) 2 2. Secondly, a factor may be non-zero only for one or a subset of groups as can be seen in Fig. 1. This leads to a sparse transformation matrix and specialized factors and dimensions. As mentioned before, the form of our linear model in Equation 7 above is based on Group Factor Analysis. While GFA typically maps a vector from the latent space to the high-dimensional space, we map here a matrix from the latent space to the original space and then use this matrix as a linear policy on the feature vectors. GFA does not apply factor analysis (see e.g. (Harman 1976)) on each group of variables separately, but instead introduces a sparsity prior on the transformation matrix W thereby forcing correlations between groups:

d=1 N w(m) d,k 0, α 1 m,k , (10)

with M being number of groups, Dm the number of dimensions of the m-th group and K the number of factors, i.e. number of columns of W. The precision α is given by a log-linear model with

log α = UVT + μu1T + 1μT v , (11)

where U RM R, V RK R and μu RM as well as μv RK model the mean proﬁle. R deﬁnes the rank of the linear model and is chosen R min (M, K). However, for the special case of R = min (M, K) the precision is given by a simple gamma distribution (Klami et al. 2015)

q (αm,k) = G aα m, bα m,k (12)

with parameters

aα m = aα + Dm

bα m,k = bα + 1

Figure 2: Graphical model in Plate notation for Group Factor Policy Search. The basis functions φ(st, t) as well as the action vector a(m) t are observed. Equation 9 shows the dependencies for a(m) t . The latent variables zt depend on the feature vector as stated in Equation (8) . The parameter αm might either be given by a Gamma distribution as stated in Equation (12) or by a log-linear model with dependencies on parameters U and V.

The hyper-parameters aα and bα are ﬁxed and set to a small positive value. The prior distributions given above will lead to three kind of factors: (1) factors which are nonzero for only one group, (2) factors which are nonzero for several groups or (3) factors which are zero. In addition to the standard GFA prior distributions above, we introduce further prior distributions for M and z such that all prior distributions are given with

M N Mold, σ2I , z N 0, Tr φφT I ,

αm,k G (aα, bα) , τm G a τ, b τ .

Fig. 2 shows a graphical model of Group Factor Policy Search, given by the distributions stated above. Instead of Z the latent variable zt is used, which depends on φ(st, t) given a state and a point in time.

Derivation of Update Equations We assume ﬁxed hyper-parameters aα, bα, a τ and b τ for the distributions which we determine using the Variational Inference method presented earlier, assuming a factorization of the q-distributions

q (θ) = q( Z)q (W) q ( τ) q (M) q (α) (15)

and additionally the assumption q( Z) = T q( zt) with Z:,t = zt. By using the factorization given above and the Variational Inference rule for deriving the parameter distribution in Equation (5), we can derive the approximated parameter distributions, which maximize the expected reward. The approximated distribution for the mean matrix is given by a multiplicative normal distribution

j=1 N m(m) j,:

T μM mj, ΣM j

where the mean and covariance parameters in dependency of the group and dimension are given by

ΣM j = σ 2I+

Tr φφT E τ [ τm]

μM mj = ΣM j moldj,: T

φ a(m) t,j Ew w(m) j,: E z [ zt]

Tr φφT E τ [ τm] 1

with mj,: given by the j-th row of M. The q-distribution for the transformation matrix is similarly given by

j=1 N w(m) j,:

T|μW mj, ΣW m

with the mean and covariance parameters

ΣW m = Ep(τ)

E z zt z T t

Tr φφT E τ [ τm] 1 + αm,K

μW mj = ΣW m Ep(τ)

a(m) t,j EM m(m) j,: φ E z [ zt]T

Tr φφT E τ [ τm] 1

The diagonal matrix αm,K is given by diag ( αm,K) = (αm,1, αm,2, , αm,K). The distribution for the latent variables Z depends on the trajectory and time. Hence the reward can be seen as a probabilistic weight R of a multiplicative normal distribution. However, since we assume independent latent variables zh t we can ignore the reward and get

t=1 N zh t |μ Z t , Σ Z t , (22)

with time-dependent parameters

Σ Z t = Tr φφT 1 I+

EW W(m)TW(m)

Tr φφT E τm τ 1 m

μ Z t = Σ Z t

EW W(m) T a(m) t M(m)φ

Tr φφT E τ [ τm] 1

Unlike the other distributions, the precision is given by a multiplicative gamma distribution

m=1 G τm|a τ + 1

2Dm T, b τ + 1

2b τ m (25)

with one ﬁxed parameter and one variable parameter. Estimation of the parameter b τ m is the most complex and computationally expensive operation given by

b τ m = Ep(τ)

t=1 Tr φφT 1 a(m) t Ta(m) t

2a(m) t TEM M(m) φ

+ 2E z [ zt]T EW W(m) T EM M(m) φ

2a(m) t TEW W(m) E z [ zt]

+ φTEM M(m)TM(m) φ

+ Tr EW W(m)TW(m) Cov z [ zt]

+E z [ zt]T EW W(m)TW(m) E z [ zt] .

Algorithm Algorithm 1 summarizes all update steps for performing Group Factor Policy Search. The reward function R ( ), number n of latent dimensions, and a set of hyperparameters need to be provided by the user.

Evaluation For evaluations and experiments the expectation Ep(τ)[ ] used above in Eq.(16-20,25) was approximated by a sample mean,

Ep(τ)[f(τ)] 1

i=1 f(τi) (27)

as proposed in (Kober and Peters 2009), where τi is the ith of the H realized trajectories and f(τ) a function value, vector or matrix for τi and will be replaced by the parameter approximations given above.

Setup of the Evaluation For the comparison between the above presented Grou PS algorithm and previous policy search algorithms, a simulated task of a bi-manual robot operating in a planar task space was used. Each of the two arms (see Fig. 3) has six degreesof-freedom and the same base for the ﬁrst joint. The initial

Figure 3: Two simulated arms with six degrees-of-freedom and the same base in their initial position. Each end effector has a desired position for each time step, s shown by the green and red dots. The ﬁnal position at time step 25 is given by the coordinate (0, 4). The numbers represent the joints with l for left and r for right.

Iterations 0 200 400 600 800 1000

Sum. Distances

POWER 1 Group

Figure 4: Comparison between Pe PPEr, Po WER, Natural Actor-Critic and three instances of the Grou PS algorithm on the presented simulated task. Values correspond to the summarized distances between each end effector and its desired position given the current policy for the iteration. The mean value as well as the standard deviations are shown.

conﬁguration of the arms is presented in Fig. 3 as well as the desired positions for each end effector (tip of an arm). At each of the 25 time steps we give a different goal position for each arm s end effector, starting from the left for the left arm and starting from the right for the right arm, with the same ﬁnal position at (0, 4) for both arms. In this task, the 12 dimensions of the action vector a represent the joint angles for each arm. For the basis functions eleven isotropic Gaussian distributions were used with φi(t) = N(t|μφ i , 3) for t {1, 2, . . . , 24, 25}. In total, 132 parameters have to be estimated given M R12 11. As reference algorithms Po WER (Kober and Peters 2009), Natural Actor-Critic (NAC) (Peters and Schaal 2008) and Pe PPEr (Luck et al. 2014) were chosen: NAC is a policy gradient method while Po WER is an efﬁcient policy search method based on expectation maximization (EM). Po WER

has been experimentally validated in both simulated and physical robotic experiments (Kober and Peters 2011). Pe PPEr is also based on EM and incorporates policy search and dimensionality reduction, but without priors and thus without a structured transformation matrix. For comparison with Pe PPEr and Po WER the Grou PS algorithm was evaluated in three different conﬁgurations: (1) One group which contains all joints of both arms, (2) two groups, where each group contains the joints of one arm and (3) four groups with two groups per arm and joints 1-4 in one and joints 5-6 in the second group. The hyper-parameters of Grou PS were set to a τ = b τ = 1000, aα = bα = 1 and σ2 = 100. No optimizations of the hyper-parameters were performed. Furthermore, to prevent early convergence and collapsing of the distributions due to small sample sizes the parameter W and τ are resized after each iteration by a factor of 1.5. The same is done after each iteration for Pe PPEr. However, the factor was set to

1.09 since higher numbers lead to divergence in the parameters of the algorithm with unstable and divergent results. Pe PPEr was implemented as presented in (Luck et al. 2014) and in each iteration 20 inner iterations for the optimizations of the parameters were used. The same setup was used for Grou PS and for both algorithms the number of latent dimensions were set to six. The static variance parameter for Po WER as presented in (Kober and Peters 2009) and the initial variance of the other algorithms were all set to 101.5, also for NAC with learning parameter set to 0.5. In each iteration, we sampled 30 trajectories and evaluated the trajectories based on the reward function

t=1 exp ( effl(at) posl(t) 2)

exp ( effr(at) posr(t) 2) ,

where the function effl(at) returns the position of the left end effector given the action vector and posl(t) the corresponding desired goal position for time point t. effr(at) and posr(t) return the actual and desired positions, respectively, for the right end effector. Then the 15 best trajectories are chosen for the computation of the parameters for each algorithm as described in (Kober and Peters 2009).

Results Fig. 4 depicts the results of the explained experiment. For each algorithm ten different runs were executed and both mean and standard deviation computed. As can be seen in the ﬁgure, Pe PPEr outperforms both Po WER and NAC, as well as our method in case only one group spanning all variables is used. However, using two groups (one for each arm) already leads to comparable performance. Finally, the Grou PS algorithm with 4 different groups signiﬁcantly outperforms the comparison methods.

Importance of the Choice of Groups In order to investigate the effect of choosing joint groups we conducted an additional experiment. Our working hypothesis throughout the paper is that structural information about inherent groups of correlated variables will im-

Iterations 0 200 400 600 800 1000

Sum. Distances

Swap1 Swap2 Swap3 4 Groups

Figure 5: Comparison between the original chosen four groups and three permutations of the Groups. Values correspond to the summarized distance between each end effector and its desired position for each time step given the current policy for the iteration.

Iterations 0 200 400 600 800 1000

Sum. Distances

Swap4 Swap5 4 Groups

Figure 6: Comparison between the original grouping and two other variants with a different splitting point. Again, the values represent the summarized distances and shaded ares corresponds to the standard deviation given ten executions.

prove the search. Conversely, if we provide wrong information about groupings the performance of the algorithm should deteriorate. To evaluate this hypothesis, we took the original partitioning of the joints into four groups and swapped two, later three pairs of joints randomly. As described above, the original group partitioning is {(1l, 2l, 3l, 4l), (5l, 6l), (1r, 2r, 3r, 4r), (5r, 6r)}. Performing two random swaps between the left and right side results in {(1l, 2l, 2r, 4l), (5l, 5r), (1r, 3l, 3r, 4r) , (6l, 6r)} (Fig. 6, Swap4). For three swaps the resulting partition is {(1l, 6r, 2r, 4l), (3r, 6l), (1r, 3l, 5l, 4r), (5r, 2l)} (Fig. 6, Swap5). Furthermore, three other groupings with different splitting points were evaluated: {(1l, 2l), (3l, 4l, 5l, 6l), (1r, 2r), (3r, 4r, 5r, 6r)} (Fig. 5, Swap1), {(1l, 2l), (3l, 4l), (5l, 6l), (1r, 2r), (3r, 4r), (5r, 6r)} (Fig. 5, Swap2) and {(1l, 2l, 3l), (4l, 5l, 6l), (1r, 2r, 3r), (4r, 5r, 6r)} (Fig. 5, Swap3). The result of executing Grou PS with these groupings can be seen in Fig. 5 and

Figure 7: Final policy found by the Grou PS algorithm after 100 iterations. A high reward is given if the head as well as the left foot of the robot are high above the ground.

Fig. 6. All new groupings (resulting from above swaps) are clearly outperformed by the original partition. This result corroborates our assumption that a proper selection of groups can ameliorate the performance of the policy search algorithm.

Experiment: Lifting a Leg

To test the Grou PS algorithm in experiments following the real world closely, we reproduced the experiment stated in (Luck et al. 2014): We simulate a NAO robot (Gouaillier et al. 2008) using the V-REP framework (Rohmer, Singh, and Freese 2013) in the task of lifting its left leg without falling. The same reward function was used as presented in (Luck et al. 2014, Eq. (22)) with parameters α = 5, β = 10, γ = 10 and λmax = 6. The V-REP framework (Rohmer, Singh, and Freese 2013) allows for simulations with high physical accuracy by utilizing the bullet physics library. In this experiment, the actions represent the 26 joint velocities for each of the 15 points in time. Again, for feature functions Gaussian distributions were used and the same parameters for Grou PS were chosen like given in the evaluation above. We ran Grou PS for 100 iterations. In each iteration, we used a set of 20 samples, of which ten were randomly selected from the set of 20 in the previous iteration and ten generated by the current policy. We used ten best samples out of this set of 20 for computing the new policy parameters. The groups were created in such a manner that the joints of each arm or leg form a single group as well as the joints of the head. The results are given in Fig. 7, where we ﬁnd that the Grou PS algorithm is able to ﬁnd a satisfactory solution even with a relatively small number of samples: the head and left leg of the NAO robot are at high positions corresponding to a high reward.

Conclusion and Future Work

In this paper, we introduced a novel algorithm for reinforcement learning in low-dimensional latent spaces. To this end, we derived a Variational Inference framework for policy search that takes prior structural information into account. The resulting policy search algorithm can efﬁciently learn new policy parameters, while also uncovering the underlying latent space of solutions, and incorporating prior

knowledge about groups of correlated parameters. In experiments using motor skill learning tasks, we showed that the introduced Grou PS algorithm efﬁciently learns new motor skills. It signiﬁcantly outperformed state-of-the-art policy search methods, whenever prior information about structural groups was provided. So far, the dimensionality of the latent space needs to be provided as a parameter to the reinforcement learning algorithm. We plan to investigate automatic adjustments of the dimensionality using current rewards. In this paper, we focused on intra-group correlations. In future work, we plan to investigate correlations among extracted group factors, e.g., correlations between arms and legs.

Acknowledgments

J.Pajarinen and V.Kyrki were supported by the Academy of Finland, decision 271394.

Bernstein, N. A. 1967. The co-ordination and regulation of movements. Pergamon Press. Bishop, C. M. 2006. Pattern recognition and machine learning. Springer. Bitzer, S.; Howard, M.; and Vijayakumar, S. 2010. Using dimensionality reduction to exploit constraints in reinforcement learning. In Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 3219 3225. IEEE. Dempster, A. P.; Laird, N. M.; and Rubin, D. B. 1977. Maximum Likelihood from Incomplete Data via the EM Algorithm. Journal of the Royal Statistical Society. Series B (Methodological) 39(1):1 38. Gouaillier, D.; Hugel, V.; Blazevic, P.; Kilner, C.; Monceaux, J.; Lafourcade, P.; Marnier, B.; Serre, J.; and Maisonnier, B. 2008. The nao humanoid: a combination of performance and affordability. ar Xiv preprint ar Xiv:0807.3223. Harman, H. H. 1976. Modern factor analysis. University of Chicago Press. Klami, A.; Virtanen, S.; Leppaaho, E.; and Kaski, S. 2015. Group factor analysis. IEEE Transactions on Neural Networks and Learning Systems 26(9):2136 2147. Kober, J., and Peters, J. 2009. Policy search for motor primitives in robotics. In Advances in Neural Information Processing Systems (NIPS), 849 856. Kober, J., and Peters, J. 2011. Policy search for motor primitives in robotics. Machine Learning 84(1):171 203. Kolter, J. Z., and Ng, A. Y. 2007. Learning omnidirectional path following using dimensionality reduction. In Proceedings of the Robotics: Science and Systems (R:SS) conference. The MIT Press. Luck, K. S.; Neumann, G.; Berger, E.; Peters, J.; and Ben Amor, H. 2014. Latent space policy search for robotics. In Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 1434 1440. IEEE.

Neumann, G. 2011. Variational inference for policy search in changing situations. In Proceedings of the 28th International Conference on Machine Learning (ICML), 817 824. Peters, J., and Schaal, S. 2008. Natural actor-critic. Neurocomputing 71(7):1180 1190. Peters, J.; M ulling, K.; Kober, J.; Nguyen-Tuong, D.; and Kr omer, O. 2011. Towards motor skill learning for robotics. In Robotics Research. Springer. 469 482. Rohmer, E.; Singh, S. P.; and Freese, M. 2013. V-REP: A versatile and scalable robot simulation framework. In Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 1321 1326. IEEE. Sagduyu, Y. E., and Ephremides, A. 2004. The problem of medium access control in wireless sensor networks. IEEE Wireless Communications 11(6):44 53. Santello, M.; Flanders, M.; and Soechting, J. 1998. Postural hand synergies for tool use. The Journal of Neuroscience 18(23). Torres-Oviedo, G., and Ting, L. H. 2010. Subject-speciﬁc muscle synergies in human balance control are consistent across different biomechanical contexts. Journal of Neurophysiology 103(6):3084 3098. Toussaint, M., and Storkey, A. 2006. Probabilistic inference for solving discrete and continuous state Markov Decision Processes. In Proceedings of the 23rd International Conference on Machine Learning (ICML), 945 952. Toussaint, M. 2009. Robot trajectory optimization using approximate inference. In Proceedings of the 26th annual International Conference on Machine Learning (ICML), 1049 1056. ACM. van de Meent, J.-W.; Tolpin, D.; Paige, B.; and Wood, F. 2015. Black-box policy search with probabilistic programs. ar Xiv preprint ar Xiv:1507.04635. Wang, X.; O Dwyer, N.; and Halaki, M. 2013. A review on the coordinative structure of human walking and the application of principal component analysis. Neural Regeneration Research 8(7):662 670.