# scalable_multitask_policy_gradient_reinforcement_learning__85625410.pdf

Scalable Multitask Policy Gradient Reinforcement Learning

Salam El Bsat Raﬁk Hariri University Haitham Bou Ammar American University of Beirut Matthew E. Taylor Washington State University

Policy search reinforcement learning (RL) allows agents to learn autonomously with limited feedback. However, such methods typically require extensive experience for successful behavior due to their tabula rasa nature. Multitask RL is an approach, which aims to reduce data requirements by allowing knowledge transfer between tasks. Although successful, current multitask learning methods suffer from scalability issues when considering large number of tasks. The main reasons behind this limitation is the reliance on centralized solutions. This paper proposes to a novel distributed multitask RL framework, improving the scalability across many different types of tasks. Our framework maps multitask RL to an instance of general consensus and develops an efﬁcient decentralized solver. We justify the correctness of the algorithm both theoretically and empirically: we ﬁrst proof an improvement of convergence speed to an order of O 1

k with k being the number of iterations, and then show our algorithm surpassing others on multiple dynamical system benchmarks.

Introduction Reinforcement learning (RL) allows agents to solve sequential decision-making problems with limited feedback. Applications with these characteristics range from robotics control (Kober and Peters 2009) to personalized medicine (Murphy et al. 2007; Pineau et al. 2007). Though successful, typical RL methods require substantial experience before acquiring acceptable behavior. The cost of obtaining such experience can be prohibitively expensive in terms of time and data. Transfer learning (Taylor and Stone 2009) and multitask learning (Lazaric and Ghavamzadeh 2010; Li, Liao, and Carin 2009) have been developed to remedy these problems by allowing agents to reuse knowledge across tasks. Unfortunately, both these techniques suffer from scalability problems as the number of tasks grows large. Among the different methods proposed for handling scalability in supervised learning, two directions stand-out. First, data (i.e., trajectories or tasks) are streamed sequentially online, leading to regret minimization games where the learner plays against an adversary. Second, decentralized solvers allow multiple process-

Copyright c 2017, Association for the Advancement of Artiﬁcial Intelligence (www.aaai.org). All rights reserved.

ing units to share work. For example, a distributed version of multitask support vector machines can signiﬁcantly outperform centralized solvers (Forero, Cano, and Giannakis 2010). Centralized online solvers have been well studied under the multitask learning setting (Ammar, Tutunov, and Eaton 2015; Bou Ammar et al. 2015; Gheshlaghi Azar, Lazaric, and Brunskill 2013). For example, Bou Ammar et al. proposed PG-ELLA, an online multitask policy search algorithm. PG-ELLA decomposes a task s policy parameters into a shared latent repository, L, and task speciﬁc coefﬁcients, st, one per task. It allows for knowledge transfer between tasks (using L) while streaming problems online. The main drawback of PG-ELLA (and the like, e.g., (Kumar and Daum e III 2012), (Bou Ammar et al. 2015), and (Bou Ammar et al. 2012)) is apparent when considering a large number of tasks or dimensions determining the shared knowledge-base L becomes intractable due to two inefﬁciencies. The ﬁrst inefﬁciency is that computing the expansion s operating point amounts to solving a local RL problem described by the current task s observed trajectories. These RL optima must be computed online at each round (i.e., time-step of the interaction between the agent and adversary) of PG-ELLA. This problem is remedied in the original paper by taking gradient steps in the local MDP s objective function, while reducing the number of trajectories used. Unfortunately, such a solution has multiple drawbacks.1 The second inefﬁciency reducing PG-ELLA s scalability arises when updating the shared repository L. The update involves an inversion of a dp dp matrix, with d being the number of features. This leads to a complexity of O d3p2 at each iteration of the algorithm. Consequently, such a method is not scalable when the policy parameterization or latent dimensions p becomes high dimensional. Moreover, as the number of tasks grows large, these centralized methods (used in updating L or computing θ j ) be-

1Crucial to the success of PG-ELLA are good-enough local policy parameters that are encoded by the shared repository. Following a policy gradient step, however, is not guaranteed to provide informative local parameters due to the O(1/

k) convergence rate of gradient-based methods, which is the fastest rate for gradient-based techniques (Wei and Ozdaglar 2012) known so far, reduction in the number of trajectories used, and the local minima problems inherit to such techniques.

Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence (AAAI-17)

come intractable due to computational and memory constraints. This paper introduces the ﬁrst framework for addressing the multitask RL problem in a distributed and scalable manner. Our method maps multitask policy search to an instance of general consensus. We justify the correctness of our method both theoretically and empirically. First, we show linear convergence speeds in the order of O 1

k with k being the iteration count. Second, we demonstrate that our algorithm can surpass state-of-the-art techniques on a variety of benchmark dynamical systems. This paper s contributions can be summarized as: i) developing the ﬁrst scalable multitask policy search reinforcement learner, ii) formally describing the relation between RL and general consensus optimization, iii) developing a distributed multidimensional alternating direction method of multipliers (ADMM) solver for multitask RL, iv) proving linear convergence in the multidimensional setting, and v) empirically outperforming state-of-the-art techniques on ﬁve benchmark dynamical systems.

Reinforcement Learning

Reinforcement learning (RL) is a technique used to solve sequential decision making problems with limited feedback. RL typically models such problems as a Markov decision process (MDP): X, U, P, R, γ , where X Rd is the (potentially inﬁnite) state-space, U Rc is the action space, P : X U X [0, 1] is the transition function describing the environment s dynamics, R : X U X R is the reward function quantifying the agent s behaviour, and γ (0, 1] is the discount factor. The dynamics of an RL agent commences as follows. At each time step h, the agent resides in a state xh X. Upon applying an action uh U, the agent transitions to a new state xh+1 X according to xh+1 P (xh+1|xh, uh) and receives a reward rh+1 = R(xh, uh, xh+1) quantifying such a transition. Repeating the aforementioned process for a horizon length H, the agent accumulates a sequence of state-action pairs, which we refer to as a trajectory denoted by τ = [x1:H, u1:H]. The agent s goal is then deﬁned as that of ﬁnding an action selection rule (i.e., a policy π) that maximizes the total accumulated rewards over multiple interactions with the environment.

Policy Search Reinforcement Learning

Policy search RL has shown successes in high-dimensional control problems (e.g., robotics control (Kober and Peters 2011)). In policy search, πθ is the policy, parameterized by a vector of parameters θ Rd. The agent s goal is to determine θ, maximizing:

J (θ) = Epθ(τ) [R(τ)] =

τ pθ(τ)R(τ)dτ, (1)

where pθ(τ) and R(τ) are the probabilities of acquiring a trajectory τ and its total accumulated reward:

pθ(τ) = P0(x0)

h=1 P (xh+1|xh, uh) πθ (uh|xh)

with P0 : X [0, 1] being the initial state distribution. Most policy gradient algorithms maximize a lower-bound to Equation 1 by generating trajectories using the current policy (πθ) and then comparing the result with a new policy parameterized by θ. As detailed elsewhere (Kober and Peters 2011), the expected return can be lower-bounded using Jensen s inequality and the concavity of the logarithm:

log J (θ) = log pθ(τ)R(τ)dτ (2)

pθ(τ)R(τ) log pθ(τ)

DKL pθ(τ)R(τ)||pθ(τ) = JL,θ pθ(τ) ,

where DKL (p(τ)||q(τ)) = p(τ) log p(τ)

q(τ)dτ. We see that this is equivalent to minimizing the KL divergence between the reward-weighted trajectory distribution of πθ and the trajectory distribution pθ of the new policy πθ.

Multi-Task Policy Search Multitask policy search (MTPS) allows knowledge sharing and transfer across a group of domains. In MTPS, the agent is faced with a set of T tasks, each being an MDP denoted by Xt, Ut, Pt, Rt, γt . The goal of the agent is to learn a set of policies Π = {π1, . . . , πT }, one for each task. Following policy gradients, we parameterize the policy for a task t by a vector θt Rd, which must be found by maximizing the total expected return Jt (θt) =

τt pθt(τt)Rt(τt)dτt. Consequently, when considering all tasks from t = 1 to t = T, the MTPS optimization problem is:

t=1 Jt(θt) + Reg (θ1, θ2, . . . , θT ) ,

where Reg (θ1, θ2, . . . , θT ) is a regularization function that ensures improvement over single task learning, imposing shared structure between task policies. To allow for knowledge transfer between tasks, we assume a factored model represents the policy parameters, similar to the settings introduced elsewhere (Ammar, Tutunov, and Eaton 2015; Kumar and Daum e III 2012). Namely, we assume θt = Lst. Here, L Rd k is a matrix used to represent k latent knowledge components, shared across all tasks. Furthermore, the task-speciﬁc coefﬁcient vectors st Rk 1 specialize this knowledge to a task t by ﬁltering over the shared repository L. Given the above factored model, we rewrite the optimization problem of MTPS as:

min L,s1:s T

t=1 [Jt (Lst) + μ1 ||st||1] + μ2||L||2 F,

with ||A||F denoting the Frobenius norm of a matrix A, and || ||1 being the L1 norm used to induce sparsity. The remaining ingredient needed to ﬁnalize the deﬁnition of MTPS is the usage of the lower-bound derived in Equation 2, instead of Jt( ), leading to:

min L,s1:s T

t=1 Epθt(τt)

Rt (τt) log (p Lst (τt))

+ μ1 ||st||1 + μ2||L||2 F,

where we wrote the optimization variable θt = Lst to introduce transfer through L, and

p Lst (τt) = P0,t(x0,t)

ht=1 Pt (xh+1,t|xh,t, uh,t) (4)

πLst (uh,t|xh,t)

h=1 Rt (xh,t, uh,t, xh+1,t) . (5)

Multi-Action RL & Gaussian Policies Recent MTPS work typically focuses on systems exhibiting multidimensional state spaces and one dimensional actions. We generalize this notion to the multidimensional action case by considering a matrix feedback controller. To do so, we assume a matrix Φh,t (xh,t) Rc d encodes multidimensional features of the state space for a task t {1, . . . , T} at time step h {1, . . . , Ht}. Hence, a cdimensional control vector uh,t can be written as: uh,t = Φh,t (xh,t) θt + ϵh,t, where ϵh,t Rc is a c-dimensional Gaussian noise used for exploration. Consequently, the stochastic multidimensional policy is:

πθt (uh,t|xh,t) = N Φtθt, Σ 1 h,t , (6)

where Σ 1 h,t Rc c is a time-dependent covariance matrix. Substituting Equations 6 and 4 in 3 ﬁnalizes the MTPS problem:

min L,s1:s T

t=1 Epθt(τt)

uh,t Φh,t Lst θt

Σ 1 h,t (7) + μ1 ||st||1

+ μ2||L||2 F.

Equation 7 can also be explained intuitively. In the second term, the agent is solving an extended (over the Horizon of the trajectory) least squares problem with trajectory rewards acting as weights. In other words, a sample point (being a trajectory in our setting) is given high values if the reward attained is high and its signiﬁcance is reduced in case of low values. Hence, when a good trajectory is observed, the agent tries to replicate the chosen actions by minimizing the error-norm between its model (i.e., Φh,t Lst) and the real action uh,t. These norms are then summed over the horizon and a total number of exploratory trajectories (i.e., the expectation) to ensure successful learning. Finally, the outer

summation (i.e., t) ensures a good-enough repository over all tasks t {1, . . . , T}. Drawbacks of Current Solvers: Solving the problem in Equation 7 appears difﬁcult due to the non-convexity imposed by the bi-product of L and st. In the case of biproducts, standard solutions from general optimization recognize that Equation 7 is convex in each of the variables while ﬁxing the others and consider a two-step alternating optimization approach. In the ﬁrst step, the repository is determined while ﬁxing s1, . . . , s T , while in the second step, task coefﬁcients are computed given an updated L. Though successful, such methods typically assume a centralized approach. In the supervised learning setting, two techniques help scalability. In the ﬁrst technique, data is streamed sequentially online, leading to a regret-minimization game against an adversary. The ﬁeld of online multitask RL is typically referred to as lifelong RL and techniques such as PGELLA (Bou-Ammar et al. 2014; Bou Ammar et al. 2015) have been developed to handle this problem. PG-ELLA (and others) still assume central computations and adopt slow ﬁrst-order methods for computing latent models. Furthermore, due to the usage of approximations to the original loss function (e.g., second-order Taylor expansions), these methods can become inaccurate and restrictively slow, especially when the task distribution varies widely between domains. The second technique relies on distributed optimization where multiple processing units are considered. Such a direction has not been well-explored yet in the multitask RL setting. Next, we present the ﬁrst distributed solver for MTPS and demonstrate that our method improves the convergence speed of current techniques used for scalability (e.g., PG-ELLA) to O 1

k , with k being the iteration count.

Scalable Multitask Policy Search

We assume the processing units are connected by an undirected graph G = (V, E), with a node set V and an edge-set E. We assume the n nodes are connected via m edges, and we do not impose additional constraints on the topology of G. We further assume a natural ordering among the nodes from 1, . . . , |V| and eij E to denote the edges between nodes i and j with i < j. We deﬁne a block matrix A to have |E| rows and |V| columns. Suppose edge eij is the kth edge in E represented by the kth row in matrix A. Entry (k, i) of matrix A equals the identity matrix I (i.e., A(k, i) = I), entry A(k, j) of matrix A equals I (i.e., A(k, j) = I), and other entires are set to matrix 0. A(I) is the general edge-node incident matrix with identity matrix I. Each node i V is randomly assigned a set of tasks. Given a set of trajectories for all T tasks, the ﬁrst step is to randomly assign chunks of data to each processor. The next step is to devise a procedure to determine the latent model based on partial information. The model we consider exhibits a product of two terms: 1) the shared repository L, and 2) the task speciﬁc coefﬁcient st for each task. Due to the non-convexity of the product, our technique adopts alternating optimization. When considering each of the alternating steps, computing task projections can be performed

fully distributed as long as each node is equipped with a Lasso solver. Updating the repository L, on the other hand, requires more care due to its unifying nature across all tasks. Our method will handle the repository while each node computes a separate update based on its assigned tasks. This can be achieved if we map the MTPS problem to distributed general consensus. To do so, we start by introducing vec(L) to be a column-wise unrolled version of L the operator vec(L) vectorizes the d k repository into a vector of size dk with the columns of L concatenated consecutively. Using vec(L), it can be shown that the parameter model, Φh,t (xh,t) Lst, can be written as: vec (Φh,t (xh,t) Lst) = s T t Φh,t (xh,t) vec(L), with denoting the Kronecker product. Having updated the task coefﬁcients, we can rewrite the optimization problem in Equation 7 in terms of vec(L) as:

uh,t s T t Φh,t (xh,t) vec(L) 2

+ μ1 ||st||1

+ μ2||vec(L)||2 2.

Having split data across n processing units, we can rewrite the MTPS in a equivalent distributed form as:

min vec(L1):vec(Ln)

i=1 Ji (vec (Li)) + μ2||vec (Li) ||2 2 (8)

s.t. vec(L1) = vec(L2) = = vec(Ln),

where Ji (vec (Li)) denotes the per-node loss function deﬁned as:

Ji (vec (Li)) =

uh,t s T t Φh,t xh,t vec(L) 2 Σ 1 h,t

+ μ1||st||1

where Ti represents all tasks assigned to node i V such that T1 +T2 + +Tn = T. The distributed form is equivalent to the MTPS problem derived in Equation 7. The crucial difference, however, is that in distributed MTPS, each node i V updates a version of the shared repository based on only partial knowledge of the total number of tasks. These repositories are then uniﬁed using the consensus constraint (i.e., vec(L1) = = vec(Ln)) to generate a single common knowledge base vec(L) Rdk. Solving Distributed Multitask Policy Search: At this stage, any off-the-shelve distributed optimization package could be used for determining L. Distributed ﬁrst-order methods (e.g., distributed sub-gradients), however, exhibit slow convergence speeds leading to problems similar to these of PG-ELLA (Bou-Ammar et al. 2014). Our goal is an MTPS algorithm with linear convergence. The alternating direction method of multipliers (ADMM) is an optimization technique that exhibits fast convergence

speeds. ADMM solves constrained problems by relying on dual methods, where it decomposes the original problem to two subproblems. These are then solved by updating dual variables. In our setting, a decentralized ADMM solver is needed. Wei and Ozdaglar (2012) have already proposed a distributed version of ADMM and showed linear convergence of the form O 1

k with k being the total number of iterations. However, this technique is not readily applicable to our setting for two reasons. First, the method focuses on the univariate setting, which is limited by its one-dimensional setting. Second, the convergence proofs are rather restrictive. We therefore 1) generalize distributed ADMM to the multidimensional case, handling high-dimensional shared repositories, and 2) modify the convergence proofs to better suit RL. The strategy we follow in deriving our algorithm relies on decomposition techniques using a Lagrangian augmented with a penalty term. To derive the corresponding form, we ﬁrst introduce a generalized edge-node adjacency matrix A = A(Idk dk) Rdkm dkn, where Idk dk is a dkdimensional identity matrix. Notice that A is just the original adjacency matrix, A, but considers all the repository s dimensions. Further, we introduce a vector vec L =

[vec(L1), . . . , vec(Ln)]T Rdkn to be a vector concatenating all the repositories between the nodes of G. We can now rewrite Equation 8 in a more compact form as:

min vec(L1):vec(Ln)

i=1 Ji (vec (Li)) + μ2||vec (Li) ||2 2

s.t. Avec L = 0dkm.

To solve the above constraint optimization problem, we then introduce a vector of Lagrange multiplies λ Rdkm and write the augmented Lagrangian as:

LAug vec L , λ =

i=1 Ji (vec (Li)) + μ2||vec (Li) ||2 2

λT Avec L + ρ

where ρ > 0 is a parameter of the penalty term. The augmented Lagrangian is the standard Lagrangian, typically used in constrained optimization, in addition to a penalty

term of the form Avec L 2

2 used to ensure stability (e.g, see (Wei and Ozdaglar 2012; Boyd and Vandenberghe 2004)). To derive shared repository updates, we follow the standard strategy proposed in the original ADMM with crucial changes needed for the distributed setting. Contrary to (Wei and Ozdaglar 2012), we assume that each processor i keeps a multidimensional local decision estimate vec(Li) and a vector of dual variables λki with k < i. We deﬁne two sets for the neighbors of a node i, denoted by Pi and Si. Here, Pi collects all nodes having an index lower than i, i.e., Pi = {j|eij E, j < i} and Si all nodes with index higher than i, i.e., Si = {j|eij E, i < j}. Our method determines

vec L using the instructions in Algorithm 1.

Algorithm 1 consists of two steps. First, primal updates are performed by each node in G (line 3). Primal updates can be seen similar to standard ADMM with the crucial difference of being performed completely locally due to the introduction of Pi and Si. Second, given the primal, the dual variables are then computed as shown in line 4. Note that in the case of multidimensional Gaussian policies2, the primal updates on line 3 of Algorithm 1 can be computed in closed form3 after approximating the expectation by the sample mean over Mt trajectories for all tasks t {1, . . . , T}:

vec L(k+1) i =

vec(L(k) j ) + 1

vec(L(k+1) j ) 1

m=1 R τ (m) t H(m) t 1

s T t Φ(m) h,t x(m) h,t T

Σ 1 h,t s T t Φ(m) h,t x(m) h,t

m=1 R τ (m) t H(m) t 1

s T t Φh,t x(m) h,t T

Σ 1 h,tu(m) h,t ,

where degi is the degree of processing unit i. Theoretical Guarantees: We now show the theorem that derives an improvement over other techniques e.g., PG-

ELLA with O 1

to a linear convergence rate for our proposed method. Before detailing our theoretical result, however, we note the following standard assumptions (Boyd and Vandenberghe 2004; Wei and Ozdaglar 2012): Assumption 1 (Saddle Point & Penalty Boundedness) The augmented Lagrangian has a saddle point. That is, there is a solution vec L , λ such that:

LAug ( ) LAug vec L , λ LAug ( ) .

We further assume that || A vec(L)||2 2 γ, for γ > 0. Assuming the above, the appendix proves:

Theorem 1 The sequence {vec L(k) , λ(k)}k 0, con-

verges linearly with a rate given by O 1

Experiments & Results We empirically validate our method on ﬁve existing benchmark dynamical systems (Bou Ammar et al. 2015).

2This derivation can be generalized to any policy in the exponential family. 3The derivation can be found in the appendix.

Algorithm 1 Scalable Multitask Policy Search

1: Initialize vec(Li), i {1, . . . , n}, and set ρ > 0. 2: for k = 0, . . . , K do 3: Each agent i updates, vec(Li) in a sequential order from i = 1, . . . , |V| using:

vec L(k+1) i = arg min vec(Li)

Ji (vec (Li)) + μ2||vec (Li) ||2 2

vec L(k+1) i vec L(k i 1

vec (Li) vec L(k) i 1

4: Each agent updates λli for l Pi as:

λ(k+1) li = λ(k) li ρ vec L(k) l vec L(k+1) l .

The cart pole (CP) system is controlled by the cart s mass mc in kg, the pole s mass mp in kg and the pole s length l in meters. The state is given by the cart s position and velocity v, as well as the pole s angle θ and angular velocity θ. The goal is to control the pole in an upright position. The double inverted pendulum (DIP) is an extension of the cart pole system. It has one cart m0 in kg and two poles in which the corresponding lengths are l1 and l2 in meters. We assume the poles have no mass and there are two masses m1 and m2 in kg on the top of each pole. The state consists of the cart s position x1 and velocity v1, the lower pole s angle θ1 and angular velocity θ1, as well as the upper pole s θ2 and angular velocity θ2. The goal is also to learn a policy to control the two poles in a speciﬁc state. A linearized model of a helicopter (HC) assumes constant horizontal motion, characterized by two matrices A R4 4 and B R4 2. The main goal is to stabilize the helicopter by controlling the collective and differential rotor thrust. The simple mass (SM) system is characterized by the spring constant k in N/m, the damping constant d in Ns/m and the mass m in kg. The system s state is given by the position x and the velocity, v, of the mass. The goal is to train a policy for guiding the mass to a speciﬁc state. The double mass (DM) is an extension of the simple mass system with two masses m1, m2 (in kg), two springs (with spring constants k1 and k2 in N/m), and two damping constants (d1 and d2 in Ns/m). The state consists of the big mass position x1 and velocity v1, as well as the small mass position x2 and velocity v2. The goal is also to learn a policy to control the two mass in a speciﬁc state. We generated 150 tasks for each domain by varying the dynamical parameters of each of the domains (for 750 in total). The cost was given by

||xm xref||2 2, where xref was the goal state. We run each task for a total of 200 iterations. At each iteration, the learner observed a task through 50 trajectories of 150 steps and performed algorithmic updates. We used e NAC (Peters and Schaal 2008), a standard

0 50 100 150 200 Iterations

Average Cost

Dist-MTLPS PG-ELLA GO-MTL Standard PG

0 50 100 150 200 Iterations

Average Cost

Dist-MTLPS PG-ELLA GO-MTL Standard PG

0 50 100 150 200 Iterations

Average Cost

Dist-MTLPS PG-ELLA GO-MTL Standard PG

0 50 100 150 200 Iterations

Average Cost

Dist-MTLPS PG-ELLA GO-MTL Standard PG

0 50 100 150 200 Iterations

Average Cost

Dist-MTLPS PG-ELLA GO-MTL Standard PG

100 101 102

Consensus Error

Cart-Pole Double Mass Simple Mass

(f) Agreement

% Jump-Start Improvement

PG-ELLA GO-MTL Dist-MTLPS

(g) Jump-Start

% Asymptotic Improvement

GO-MTL GO-MTL Improvement Trend PG-ELLA PG-ELLA Improvement Trend Dist-MTLPS Dist-MTLPS Improvement Trend

7.2 18 16 32

(h) Asymptotic Results

Figure 1: Figures (a)-(e) report average cost versus iterations and demonstrate that Dist-MTLPS is capable of outperforming other methods. Figure (f) shows the agreement error across processors on three sample systems. Figures (g) and (h) show that Dist-MTLPS also often outperforms the competition in the jumpstart and asymptotic performance.

PG algorithm, as the base learner. We compare our method (Dist-MTLPS) to standard PG, PG-ELLA (an online variant of multitask learning introduced to handle scalability to large number of tasks), and GO-MTL (Kumar and Daume 2012). For a fair comparison against PG-ELLA, we approximate the original loss of an RL problem by a secondorder Taylor expansion around a local optimum computed at a given set of trajectories. Our method therefore solves the same problem as PG-ELLA. This comparison with PGELLA allows us to understand wether our method is capable of outperforming state-of-the-art techniques that adapt regret minimization games for scaling-up multitask reinforcement learning. We show in Figures (1a)-(1e) that our method can outperform all comparison methods in terms of the data complexity. One could worry that these improvements are only possible because of increased computation. However, Figure 2a shows that our technique actually achieves these data improvements while improving wall clock running times. To distribute our computations, we made use of MATLAB s parallel pool running on 10 nodes. For Dist-MTLPS, we assigned 150 tasks evenly across 10 agents. The edges between agents were generated randomly to increase the likelihood of expander graphs, which provide improved practical consideration (e.g., low condition numbers leading to increased computational stability). To make sure every node in the graph has at least one predecessor and successor, we connect every node i to node i + 1 (and wrap around to node 1 for node 10). We then randomly assigned 20 additional edges to the graph. Figures 1(a)-1(e) report the average cost on the different benchmark in terms of iterations. In all systems, our method outperforms the others in total cost incurred. As we consider constraint optimization, it is crucial for Dist-MTLPS to acquire a solution that satisﬁes the constraints. Figure 1(f) shows that our method is capable of acquiring feasible so-

Total Running Time [sec]

0.00 120.00

Dist-MTLPS PG-ELLA GO-MTL

>3500 >4000 >4000

(a) Time to Convergence

100 101 102

Average Cost

Dist-MTLPS PG-ELLA GO-MTL PG

(b) Novel Domain

Figure 2: (a) shows Dist-MTLPS takes less time to optimize the multitask learning objective function (seconds) and (b) shows our method can generalize to novel tasks.

lutions by reporting the consensus error on three example systems our method not only acquires good repositories, but is also capable of converging to an agreement between the processing units. Interestingly, when interpreted differently, this result can pave the way for considering safe MTPS (which we leave as a direction for future work). Figures 1(g) and 1(h) demonstrate the our algorithm is capable of outperforming PG-ELLA and GO-MTL in terms of jumpstarts and asymptotic performance. Figure 2(a) demonstrates that our method takes less wall clock time to optimize the objective function. As tasks become more complex, the advantage of using our distributed method increases. For instance, Dist-MTLPS can solve a problem with 50 HC tasks in about 9 seconds, compared to 200 seconds for PGELLA and over 4000 seconds for GO-MTL. Multitask learning provides signiﬁcant advantages when the agent faces a novel task domain. To evaluate this, we chose the most complex of the task domains (HC) and trained the multitask learner, using Dist-MTLPS, on 149 tasks to yield an effective shared knowledge base for each of the algorithms. Then, we evaluated the ability to learn a new task from the helicopter domain, comparing the beneﬁts of Dist-MTLPS transfer from LDist-MTLPS with PG-ELLA

(from LPG-ELLA), GO-MTL (from LGO-MTL), and PG. Figure 2(b) depicts the result of learning on a novel domain, averaged over HC tasks, showing that Dist-MTLPS converges fastest in this scenario.

Conclusions and Future Work This paper introduces the ﬁrst distributed, scalable multitask policy search framework. We show our method achieves linear convergence speeds, a signiﬁcant improvement over gradient-based methods. We further assess our technique empirically and demonstrated superiority to state-of-the-art methods on a variety of benchmark dynamical systems. Future work will include targeting the more general crossdomain multitask learning setting and running our method on real-world multiple-robotics applications.

Acknowledgements The authors would like to thank Yusen Zhan for the help in the experimental results and for the interesting discussions. Further, the authors would like to thank the anonymous reviewers who helped better-shape this paper.

References Ammar, H. B.; Tutunov, R.; and Eaton, E. 2015. Safe policy search for lifelong reinforcement learning with sublinear regret. In Proceedings of the 32nd International Conference on Machine Learning (ICML-15), 2361 2369. Bou-Ammar, H.; Tuyls, K.; Taylor, M. E.; Driessen, K.; and Weiss, G. 2012. Reinforcement Learning Transfer via Sparse Coding. In International Conference on Autonomous Agents and Multiagent Systems (AAMAS). Bou-Ammar, H.; Eaton, E.; Ruvolo, P.; and Taylor, M. 2014. Online Multi-task Learning for Policy Gradient Methods. In Jebara, T., and Xing, E. P., eds., Proceedings of the 31st International Conference on Machine Learning (ICML-14). JMLR Workshop and Conference Proceedings. Bou Ammar, H.; Eaton, E.; Luna, J. M.; and Ruvolo, P. 2015. Autonomous cross-domain knowledge transfer in lifelong policy gradient reinforcement learning. In Proceedings of the International Joint Conference on Artiﬁcial Intelligence (IJCAI-15). Boyd, S., and Vandenberghe, L. 2004. Convex Optimization. New York, NY, USA: Cambridge University Press. Forero, P. A.; Cano, A.; and Giannakis, G. B. 2010. Consensus-Based Distributed Support Vector Machines. Journal of Machine Learning Research 11:1663 1707. Gheshlaghi Azar, M.; Lazaric, A.; and Brunskill, E. 2013. Sequential Transfer in Multi-armed Bandit with Finite Set of Models. In Burges, C.; Bottou, L.; Welling, M.; Ghahramani, Z.; and Weinberger, K., eds., Advances in Neural Information Processing Systems 26. Curran Associates, Inc. Kober, J., and Peters, J. R. 2009. Policy search for motor primitives in robotics. In Advances in neural information processing systems, 849 856. Kober, J., and Peters, J. 2011. Policy Search for Motor Primitives in Robotics. Machine Learning 84(1-2).

Kumar, A., and Daum e III, H. 2012. Learning Task Grouping and Overlap in Multi-Task Learning. In International Conference on Machine Learning (ICML). Kumar, A., and Daume, H. 2012. Learning task grouping and overlap in multi-task learning. In Proceedings of the 29th International Conference on Machine Learning (ICML-12), 1383 1390. Lazaric, A., and Ghavamzadeh, M. 2010. Bayesian multitask reinforcement learning. In Proceedings of the Twenty Seventh International Conference on Machine Learning (ICML-2010). Li, H.; Liao, X.; and Carin, L. 2009. Multi-task reinforcement learning in partially observable stochastic environments. The Journal of Machine Learning Research 10:1131 1186. Murphy, S. A.; Oslin, D. W.; Rush, A. J.; and Zhu, J. 2007. Methodological challenges in constructing effective treatment sequences for chronic psychiatric disorders. Neuropsychopharmacology 32(2):257 262. Peters, J., and Schaal, S. 2008. Natural Actor-Critic. Neurocomputing 71. Pineau, J.; Bellemare, M. G.; Rush, A. J.; Ghizaru, A.; and Murphy, S. A. 2007. Constructing evidence-based treatment strategies using methods from computer science. Drug and alcohol dependence 88:S52 S60. Taylor, M. E., and Stone, P. 2009. Transfer Learning for Reinforcement Learning Domains: A Survey. Journal of Machine Learning Research 10:1633 1685. Wei, E., and Ozdaglar, A. 2012. Distributed alternating direction method of multipliers. In Decision and Control (CDC), 2012 IEEE 51st Annual Conference on, 5445 5450. IEEE.