# lifelong_learning_with_a_changing_action_set__c7d2bf4b.pdf

The Thirty-Fourth AAAI Conference on Artiﬁcial Intelligence (AAAI-20)

Lifelong Learning with a Changing Action Set

Yash Chandak,1 Georgios Theocharous,2 Chris Nota,1 Philip S. Thomas1

1University of Massachusetts Amherst, 2Adobe Research {ychandak, cnota, pthomas}@cs.umass.edu, theochar@adobe.com

In many real-world sequential decision making problems, the number of available actions (decisions) can vary over time. While problems like catastrophic forgetting, changing transition dynamics, changing rewards functions, etc. have been well-studied in the lifelong learning literature, the setting where the size of the action set changes remains unaddressed. In this paper, we present ﬁrst steps towards developing an algorithm that autonomously adapts to an action set whose size changes over time. To tackle this open problem, we break it into two problems that can be solved iteratively: inferring the underlying, unknown, structure in the space of actions and optimizing a policy that leverages this structure. We demonstrate the efﬁciency of this approach on large-scale real-world lifelong learning problems.

Introduction Real-world problems are often non-stationary. That is, parts of the problem speciﬁcation change over time. We desire autonomous systems that continually adapt by capturing the regularities in such changes, without the need to learn from scratch after every change. In this work, we address one form of lifelong learning for sequential decision making problems, wherein the set of possible actions (decisions) varies over time. Such a situation is omnipresent in real-world problems. For example, in robotics it is natural to add control components over the lifetime of a robot to enhance its ability to interact with the environment. In hierarchical reinforcement learning, an agent can create new options (Sutton, Precup, and Singh 1999) over its lifetime, which are in essence new actions. In medical decision support systems for drug prescription, new procedures and medications are continually discovered. In product recommender systems, new products are constantly added to the stock, and in tutorial recommendation systems, new tutorials are regularly developed, thereby continuously increasing the number of available actions for a recommender engine. These examples capture the broad idea that, for an agent that is deployed in real world settings, the possible decisions it can make changes over time, and motivates the question that we aim to answer: how do we develop

Copyright c 2020, Association for the Advancement of Artiﬁcial Intelligence (www.aaai.org). All rights reserved.

algorithms that can continually adapt to such changes in the action set over the agent s lifetime?

Reinforcement learning (RL) has emerged as a successful class of methods for solving sequential decision making problems. However, excluding notable exceptions that we discuss later (Boutilier et al. 2018; Mandel et al. 2017), its applications have been limited to settings where the set of actions is ﬁxed. This is likely because RL algorithms are designed to solve a mathematical formalization of decision problems called Markov decision processes (MDPs) (Puterman 2014), wherein the set of available actions is ﬁxed. To begin addressing our lifelong learning problem, we ﬁrst extend the standard MDP formulation to incorporate this aspect of changing action set size. Motivated by the regularities in real-world problems, we consider an underlying, unknown, structure in the space of actions from which new actions are generated. We then theoretically analyze the difference between what an algorithm can achieve with only the actions that are available at one point in time, and the best that the algorithm could achieve if it had access to the entire underlying space of actions (and knew the structure of this space). Leveraging insights from this theoretical analysis, we then study how the structure of the underlying action space can be recovered from interactions with the environment, and how algorithms can be developed to use this structure to facilitate lifelong learning.

As in the standard RL setting, when facing a changing action set, the parameterization of the policy plays an important role. The key consideration here is how to parameterize the policy and adapt its parameters when the set of available actions changes. To address this problem, we leverage the structure in the underlying action space to parameterize the policy such that it is invariant to the cardinality of the action set changing the number of available actions does not require changes to the number of parameters or the structure of the policy. Leveraging the structure of the underlying action space also improves generalization by allowing the agent to infer the outcomes of actions similar to actions already taken. These advantages make our approach ideal for lifelong learning problems where the action set changes over time, and where quick adaptation to these changes, via generalization of prior knowledge about the impact of actions, is beneﬁcial.

Related Works Lifelong learning is a well studied problem (Thrun 1998; Ruvolo and Eaton 2013; Silver, Yang, and Li 2013; Chen and Liu 2016). Predominantly, prior methods aim to address catastrophic forgetting problems in order to leverage prior knowledge for new tasks (French 1999; Kirkpatrick et al. 2017; Lopez-Paz and others 2017; Zenke, Poole, and Ganguli 2017). Several meta-reinforcement-learning methods aim at addressing the problem of transfer learning, few-shot shot adaption to new tasks after training on a distribution of similar tasks, and automated hyper-parameter tuning (Xu, van Hasselt, and Silver 2018; Gupta et al. 2018; Wang et al. 2017; Duan et al. 2016; Finn, Abbeel, and Levine 2017). Alternatively, many lifelong RL methods consider learning online in the presence of continuously changing transition dynamics or reward functions (Neu 2013; Gajane, Ortner, and Auer 2018). In our work, we look at a complementary aspect of the lifelong learning problem, wherein the size of the action set available to the agent change over its lifetime. Our work also draws inspiration from recent works which leverage action embeddings (Dulac-Arnold et al. 2015; He et al. 2015; Bajpai, Garg, and others 2018; Chandak et al. 2019; Tennenholtz and Mannor 2019). Building upon their ideas, we present a new objective for learning structure in the action space, and show that the performance of the policy resulting from using this inferred structure has bounded sub-optimality. Moreover, in contrast to their setup where the size of the action set is ﬁxed, we consider the case of lifelong MDP, where the number of actions changes over time. Mandel et al. (2017) and Boutilier et al. (2018) present the work most similar to ours. Mandel et al. (2017) consider the setting where humans can provide new actions to an RL system. The goal in their setup is to minimize human effort by querying for new actions only at states where new actions are most likely to boost performance. In comparison, our setup considers the case where the new actions become available through some external, unknown, process and the goal is to build learning algorithms that can efﬁciently adapt to such changes in the action set. Boutilier et al. (2018) laid the foundation for the stochastic action set MDP (SAS-MDP) setting where there is a ﬁxed, ﬁnite, number of (base) actions and the available set of actions is a stochastically chosen subset of this base set. While SAS-MDPs can also be considered to have a changing action set , unlike their work there is no ﬁxed maximum number for the available actions in our framework. Further, in their setup, there is a possibility that within a single long episode an agent can observe all possible actions it will ever encounter. In our set-up, this is never possible. As shown by Boutilier et al. (2018), SAS-MDPs can also be reduced to standard MDPs by extending the state space to include the set of available action. This cannot be done in our lifelong-MDP setup, as that would imply that the state-space is changing across episodes or the MDP is nonstationary. The works by Gabel and Riedmiller (2008) and Ferreira et al. (2017) also consider subsets of the base actions for DEC-MDPs and answer-set programming, respectively, but all the mentioned differences from the work by Boutilier et al. (2018) are also applicable here.

Figure 1: Illustration of a lifelong MDP where M0 is the base MDP. For every change k, MK builds upon Mk 1 by including the newly available set of actions Ak. The internal structure in the space of actions is hidden and only a set of discrete actions is observed.

These differences let the proposed work better capture the challenges of lifelong learning, where the cardinality of the action set itself varies over time and an agent has to deal with actions that it has never dealt with before.

Lifelong Markov Decision Process MDPs, the standard formalization of decision making problems, are not ﬂexible enough to encompass lifelong learning problems wherein the action set size changes over time. In this section we extend the standard MDP framework to model this setting. In real-world problems where the set of possible actions changes, there is often underlying structure in the set of all possible actions (those that are available, and those that may become available). For example, tutorial videos can be described by feature vectors that encode their topic, difﬁculty, length, and other attributes; in robot control tasks, primitive locomotion actions like left, right, up, and down could be encoded by their change to the Cartesian coordinates of the robot, etc. Critically, we will not assume that the agent knows this structure, merely that it exists. If actions are viewed from this perspective, then the set of all possible actions (those that are available at one point in time, and those that might become available at any time in the future) can be viewed as a vector-space, E Rd. To formalize the lifelong MDP, we ﬁrst introduce the necessary variables that govern when and how new actions are added. We denote the episode number using τ. Let Iτ {0, 1} be a random variable that indicates whether a new set of actions are added or not at the start of episode τ, and let frequency F : N [0, 1] be the associated probability distribution over episode count, such that Pr(Iτ = 1) = F(τ). Let Uτ 2E be the random variable corresponding to the set of actions that is added before the start of episode τ. When Iτ = 1, we assume that Uτ = , and when Iτ = 0, we assume that Uτ = . Let Dτ be the distribution of Uτ when Iτ = 1, i.e., Uτ Dτ if Iτ = 1. We use D to denote the set {Dτ} consisting of these distributions. Such a formulation using Iτ and Dτ provides a ﬁne control of when and how new actions can be incorporated. This allows modeling a large class of problems where both the distribution over the type of incorporated actions as well intervals between successive changes might be irregular. Often we will not require the exact episode number τ but instead require k, which denotes

the number of times the action set is changed. Since we do not assume that the agent knows the structure associated with the action, we instead provide actions to the agent as a set of discrete entities, Ak. To this end, we deﬁne φ to be a map relating the underlying structure of the new actions to the observed set of discrete actions Ak for all k, i.e., if the set of actions added is uk, then Ak = {φ(ei)|ei uk}. Naturally, for most problems of interest, neither the underlying structure E, nor the set of distributions D, nor the frequency of updates F, nor the relation φ is known the agent only has access to the observed set of discrete actions. We now deﬁne the lifelong Markov decision process (LMDP) as L = (M0, E, D, F), which extends a base MDP M0 = (S, A, P, R, γ, d0). S is the set of all possible states that the agent can be in, called the state set. A is the discrete set of actions available to the agent, and for M0 we deﬁne this set to be empty, i.e., A = . When the set of available actions changes and the agent observes a new set of discrete actions, Ak, then Mk 1 transitions to Mk, such that A in Mk is the set union of A in Mk 1 and Ak. Apart from the available actions, other aspects of the L-MDP remain the same throughout. An illustration of the framework is provided in Figure 1. We use St S, At A, and Rt R as random variables for denoting the state, action and reward at time t {0, 1, . . . } within each episode. The ﬁrst state, S0, comes from an initial distribution, d0, and the reward function R is deﬁned to be only dependent on the state such that R(s) = E[Rt|St = s] for all s S. We assume that Rt [ Rmax, Rmax] for some ﬁnite Rmax. The reward discounting parameter is given by γ [0, 1). P is the state transition function, such that for all s, a, s , t, the function P(s, a, s ) denotes the transition probability P(s |s, e), where a = φ(e).1 In the most general case, new actions could be completely arbitrary and have no relation to the ones seen before. In such cases, there is very little hope of lifelong learning by leveraging past experience. To make the problem more feasible, we resort to a notion of smoothness between actions. Formally, we assume that transition probabilities in an L-MDP are ρ Lipschitz in the structure of actions, i.e., ρ > 0 s.t.,

s, s , ei, ej P(s |s, ei) P(s |s, ej) 1 ρ ei ej 1.

For any given MDP Mk in L , an agent s goal is to ﬁnd a policy, πk, that maximizes the expected sum of discounted future rewards. For any policy πk, the corresponding state value function is vπk(s) = E[ t=0 γt Rt|s, πk].

Blessing of Changing Action Sets Finding an optimal policy when the set of possible actions is large is difﬁcult due to the curse of dimensionality. In the L-MDP setting this problem might appear to be exacerbated, as an agent must additionally adapt to the changing levels of possible performance as new actions become available. This raises the natural question: as new actions become available,

1For notational ease, (a) we overload symbol P for representing both probability mass and density; (b) we assume that the state set is ﬁnite, however, our primary results extend to MDPs with continuous states.

how much does the performance of an optimal policy change? If it ﬂuctuates signiﬁcantly, can a lifelong learning agent succeed by continuously adapting its policy, or is it better to learn from scratch with every change to the action set? To answer this question, consider an optimal policy, π k, for MDP Mk, i.e., an optimal policy when considering only policies that use actions that are available during the kth episode. We now quantify how sub-optimal π k is relative to the performance of a hypothetical policy, μ , that acts optimally given access to all possible actions.

Theorem 1. In an L-MDP, let ϵk denote the maximum distance in the underlying structure of the closest pair of available actions, i.e., ϵk := sup ai A inf aj A ei ej 1, then

vμ (s0) vπ k(s0) γρϵk (1 γ)2 Rmax.

Proof. See Appendix B.

With a bound on the maximum possible sub-optimality, Theorem 1 presents an important connection between achievable performances, the nature of underlying structure in the action space, and a property of available actions in any given Mk. Using this, we can make the following conclusion.

Corollary 1. Let Y E be the smallest closed set such that, P(Uk 2Y) = 1. We refer to Y as the element-wise-support of Uk. If k, the element-wise-support of Uk in an L-MDP is E, then as k the sub-optimality vanishes. That is,

lim k vμ (s0) vπ k(s0) 0.

Proof. See Appendix B.

Through Corollary 1, we can now establish that the change in optimal performance will eventually converge to zero as new actions are repeatedly added. An intuitive way to observe this result would be to notice that every new action that becomes available indirectly provides more information about the underlying, unknown, structure of E. However, in the limit, as the size of the available action set increases, the information provided by each each new action vanishes and thus performance saturates. Certainly, in practice, we can never have k , but this result is still advantageous. Even when the underlying structure E, the set of distributions D, the change frequency F, and the mapping relation φ are all unknown, it establishes the fact that the difference between the best performances in successive changes will remain bounded and will not ﬂuctuate arbitrarily. This opens up new possibilities for developing algorithms that do not need to start from scratch after new actions are added, but rather can build upon their past experiences using updates to their existing policies that efﬁciently leverage estimates of the structure of E to adapt to new actions.

Learning with Changing Action Sets Theorem 1 characterizes what can be achieved in principle, however, it does not specify how to achieve it how to ﬁnd π k efﬁciently. Using any parameterized policy, π, which acts

directly in the space of observed actions, suffers from one key practical drawback in the L-MDP setting. That is, the parameterization is deeply coupled with the number of actions that are available. That is, not only is the meaning of each parameter coupled with the number of actions, but often the number of parameters that the policy has is dependent on the number of possible actions. This makes it unclear how the policy should be adapted when additional actions become available. A trivial solution would be to ignore the newly available actions and continue only using the previously available actions. However, this is clearly myopic, and will prevent the agent from achieving the better long term returns that might be possible using the new actions. To address this parameterization-problem, instead of having the policy, π, act directly in the observed action space, A, we propose an approach wherein the agent reasons about the underlying structure of the problem in a way that makes its policy parameterization invariant to the number of actions that are available. To do so, we split the policy parameterization into two components. The ﬁrst component corresponds to the state conditional policy responsible for making the decisions, β : S ˆE [0, 1], where ˆE Rd. The second component corresponds to ˆφ : ˆE A [0, 1], an estimator of the relation φ, which is used to map the output of β to an action in the set of available actions. That is, an Et ˆE is sampled from β(St, ) and then ˆφ(Et) is used to obtain the action At. Together, β and ˆφ form a complete policy, and ˆE corresponds to the inferred structure in action space. One of the prime beneﬁts of estimating φ with ˆφ is that it makes the parameterization of β invariant to the cardinality of the action set changing the number of available actions does not require changing the number of parameters of β. Instead, only the parameterization of ˆφ, the estimator of the underlying structure in action space, must be modiﬁed when new actions become available. We show next that the update to the parameters of ˆφ can be performed using supervised learning methods that are independent of the reward signal and thus typically more efﬁcient than RL methods. While our proposed parameterization of the policy using both β and ˆφ has the advantages described above, the performance of β is now constrained by the quality of ˆφ, as in the end ˆφ is responsible for selecting an action from A. Ideally we want ˆφ to be such that it lets β be both: (a) invariant to the cardinality of the action set for practical reasons and (b) as expressive as a policy, π, explicitly parameterized for the currently available actions. Similar trade-offs have been considered in the context of learning optimal state-embeddings for representing sub-goals in hierarchical RL (Nachum et al. 2018). For our lifelong learning setting, we build upon their method to efﬁciently estimate ˆφ in a way that provides bounded sub-optimality. Speciﬁcally, we make use of an additional inverse dynamics function, ϕ, that takes as input two states, s and s , and produces as output a prediction of which e E caused the transition from s to s . Since the agent does not know φ, when it observes a transition from s to s via action a, it does not know which e caused this transition. So, we cannot train ϕ to make good predictions using the actual

action, e, that caused the transition. Instead, we use ˆφ to transform the prediction of ϕ from e E to a A, and train both ϕ and ˆφ so that this process accurately predicts which action, a, caused the transition from s to s . Moreover, rather than viewing ϕ as a deterministic function mapping states s and s to predictions e, we deﬁne ϕ to be a distribution over E given two states, s and s . For any given Mk in L-MDP L , let βk and ˆφk denote the two components of the overall policy and let π k denote the best overall policy that can be represented using some ﬁxed ˆφk. The following theorem bounds the sub-optimality of π k . Theorem 2. For an L-MDP Mk, If there exists a ϕ : S S ˆE [0, 1] and ˆφk : ˆE A [0, 1] such that

sup s S,a A KL P(St+1|St = s, At = a)

P(St+1|St = s, At = ˆA) δ2 k/2, (1)

where ˆA ˆφk( | ˆE) and ˆE ϕ( |St, St+1), then

vμ (s0) vπ k (s0) γ (ρϵk + δk)

(1 γ)2 Rmax.

Proof. See Appendix B.

By quantifying the impact ˆφ has on the sub-optimality of achievable performance, Theorem 2 provides the necessary constraints for estimating ˆφ. At a high level, Equation (1) ensures ˆφ to be such that it can be used to generate an action corresponding to any s to s transition. This allows β to leverage ˆφ and choose the required action that induces the state transition needed for maximizing performance. Thereby, following (1), sub-optimality would be minimized if ˆφ and ϕ are optimized to reduce the supremum of KL divergence over all s and a. In practice, however, the agent does not have access to all possible states, rather it has access to a limited set of samples collected from interactions with the environment. Therefore, instead of the supremum, we propose minimizing the average over all s and a from a set of observed transitions,

a Ak P(s, a)KL (P(s |s, a) P(s |s, ˆa)) .(2)

Equation (2) suggests that L(ˆφ, ϕ) would be minimized when ˆa equals a, but using (2) directly in the current form is inefﬁcient as it requires computing KL over all probable s S for a given s and a. To make it practical, we make use of the following property.

Property 1. For some constant C, L(ˆφ, ϕ) is lower bounded by

s S P(s, a, s )

E log ˆφ(ˆa|ˆe) ˆe ϕ( |s, s )

KL ϕ(ˆe|s, s ) P(ˆe|s, s )

Proof. See Appendix C.

Figure 2: An illustration of a typical performance curve for a lifelong learning agent. The point (a) corresponds to the performance of the current policy in Mk. The point (b) corresponds to the performance drop resulting as a consequence of adding new actions. We call the phase between (a) and (b) as the adaptation phase, which aims at minimizing this drop when adapting to new set of actions. The point (c) corresponds to the improved performance in Mk+1 by optimizing the policy to leverage the new set of available actions. μ represents the best performance of the hypothetical policy which has access to the entire structure in the action space.

As minimizing L(ˆφ, ϕ) is equivalent to maximizing L(ˆφ, ϕ), we consider maximizing the lower bound obtained from Property 1. In this form, it is now practical to optimize (3) just by using the observed (s, a, s ) samples. As this form is similar to the objective for variational auto-encoder, inner expectation can be efﬁciently optimized using the reparameterization trick (Kingma and Welling 2013). P(ˆe|s, s ) is the prior on ˆe, and we treat it as a hyper-parameter that allows the KL to be computed in closed form. Importantly, note that this optimization procedure only requires individual transitions, s, a, s , and is independent of the reward signal. Hence, at its core, it is a supervised learning procedure. This means that learning good parameters for ˆφ tends to require far fewer samples than optimizing β (which is an RL problem). This is beneﬁcial for our approach because ˆφ, the component of the policy where new parameters need to be added when new actions become available, can be updated efﬁciently. As both β and ϕ are invariant to action cardinality, they do not require new parameters when new actions become available. Additional implementation level details are available in Appendix F.

When a new set of actions, Ak+1, becomes available, the agent should leverage the existing knowledge and quickly adapt to the new action set. Therefore, during every change in Mk, the ongoing best components of the policy, β k 1 and φ k 1, in Mk 1 are carried over, i.e., βk := β k 1 and ˆφk := ˆφ k 1. For lifelong learning, the following property illustrates a way to organize the learning procedure so as to minimize the sub-optimality in each Mk, for all k.

Property 2. (Lifelong Adaptation and Improvement) In an LMDP, let Δ denote the difference of performance between vμ

and the best achievable using our policy parameterization, then the overall sub-optimality can be expressed as,

vμ (s0) vβ1 ˆφ1 M1 (s0) =

vβk ˆφ k Mk (s0) vβk ˆφk Mk (s0)

vβ k ˆφ k Mk (s0) vβk ˆφ k Mk (s0)

Policy Improvement

where Mk is used in the subscript to emphasize the respective MDP in L . Proof: See Appendix D.

Property 2 illustrates a way to understand the impact of β and ˆφ by splitting the learning process into an adaptation phase and a policy improvement phase. These two iterative phases are the crux of our algorithm for solving an L-MDP L . Based on this principle, we call our algorithm LAICA: lifelong adaptation and improvement for changing actions. Due to space constraints, we now brieﬂy discuss the LAICA algorithm; a detailed description with pseudocode is presented in Appendix E. Whenever new actions become available, adaptation is prone to cause a performance drop as the agent has no information about when to use the new actions, and so its initial uses of the new actions may be at inappropriate times. Following Property 1, we update ˆφ so as to efﬁciently infer the underlying structure and minimize this drop. That is, for every Mk, ˆφk is ﬁrst adapted to ˆφ k in the adaptation phase by adding more parameters for the new set of actions and then optimizing (3). After that, ˆφ k is ﬁxed and βk is improved towards β k in the policy improvement phase, by updating the parameters of βk using the policy gradient theorem (Sutton et al. 2000). These two procedures are performed sequentially whenever Mk 1 transitions to Mk, for all k, in an L-MDP L . An illustration of the procedure is presented in Figure 2. A step-by-step pseudo-code for the LAICA algorithm is available in Algorithm 1, Appendix E. The crux of the algorithm is based on the iterative adapt and improve procedure obtained from Property 2.

Empirical Analysis In this section, we aim to empirically compare the following methods,

Baseline(1): The policy is re-initialised and the agent learns from scratch after every change.

Baseline(2): New parameters corresponding to new actions are added/stacked to the existing policy (and previously learned parameters are carried forward as-is).

LAICA(1): The proposed approach that leverages the structure in the action space. To act in continuous space of inferred structure, we use DPG (Silver et al. 2014) to optimize β.

Figure 3: Lifelong learning experiments with a changing set of actions in the maze domain. The learning curves correspond to the running mean of the best performing setting for each of the algorithms. The shaded regions correspond to standard error obtained using 10 trials. Vertical dotted bars indicate when the set of actions was changed.

LAICA(2): A variant of LAICA which uses an actor-critic (Sutton and Barto 2018) to optimize β.

To demonstrate the effectiveness of our proposed method(s) on lifelong learning problems, we consider a maze environment and two domains corresponding to real-world applications, all with a large set of changing actions. For each of these domains, the total number of actions were randomly split into ﬁve equal sets. Initially, the agent only had the actions available in the ﬁrst set and after every change the next set of actions was made available additionally. In the following paragraphs we brieﬂy outline the domains; full details are deferred to Appendix F.

Maze Domain. As a proof-of-concept, we constructed a continuous-state maze environment where the state is comprised of the coordinates of the agent s location and its objective is to reach a ﬁxed goal state. The agent has a total of 256 actions corresponding to displacements in different directions of different magnitudes. This domain provides a simple yet challenging testbed that requires solving a long horizon task using a large, changing action set, in presence of a single goal reward.

Case Study: Real-World Recommender Systems. We consider the following two real-world applications of largescale recommender systems that require decision making over multiple time steps and where the number of possible decisions varies over the lifetime of the system.

A web-based video-tutorial platform, that has a recommendation engine to suggest a series of tutorial videos. The aim is to meaningfully engage the users in a learning activity. In total, 1498 tutorials were considered for recommendation.

A professional multi-media editing software, where sequences of tools inside the software need to be recommended. The aim is to increase user productivity and assist

users in quickly achieving their end goal. In total, 1843 tools were considered for recommendation.

For both of these applications, an existing log of user s click stream data was used to create an n-gram based MDP model for user behavior (Shani, Heckerman, and Brafman 2005). Sequences of user interaction were aggregated to obtain over 29 million clicks and 1.75 billion user clicks for the tutorial recommendation and the tool recommendation task, respectively. The MDP had continuous state-space, where each state consisted of the feature descriptors associated with each item (tutorial or tool) in the current n-gram.

The plots in Figures 3 and 4 present the evaluations on the domains considered. The advantage of LAICA over Baseline(1) can be attributed to its policy parameterization. The decision making component of the policy, β, being invariant to the action cardinality can be readily leveraged after every change without having to be re-initialized. This demonstrates that efﬁciently re-using past knowledge can improve data efﬁciency over the approach that learns from scratch every time. Compared to Baseline(2), which also does not start from scratch and reuses existing policy, we notice that the variants of LAICA algorithm still perform favorably. As evident from the plots in Figures 3 and 4, while Baseline(2) does a good job of preserving the existing policy, it fails to efﬁciently capture the beneﬁt of new actions. While the policy parameters in both LAICA and Baseline(2) are improved using policy gradients, the superior performance of LAICA can be attributed to the adaptation procedure incorporated in LAICA which aims at efﬁciently inferring the underlying structure in the space of actions. Overall LAICA(2) performs almost twice as well as both the baselines on all of the tasks considered. In the maze domain, even the best setting for Baseline(2) performed inconsistently. Due to the sparse reward nature of the task, which only had a big positive reward on reaching goal, even the best setting for Baseline(2) failed on certain

Figure 4: Lifelong learning experiments with a changing set of actions in the recommender system domains. The learning curves correspond to the running mean of the best performing setting for each of the algorithms. The shaded regions correspond to standard error obtained using 10 trials. Vertical dotted bars indicate when the set of actions was changed.

trials, resulting in high variance. Note that even before the ﬁrst addition of the new set of actions, the proposed method performs better than the baselines. This can be attributed to the fact that the proposed method efﬁciently leverages the underlying structure in the action set and thus learns faster. Similar observations have been made previously (Dulac-Arnold et al. 2015; He et al. 2015; Bajpai, Garg, and others 2018). In terms of computational cost, the proposed method updates the inverse dynamics model and the underlying action structure only when there is a change in the action set (Algorithm 1). Therefore, compared to the baselines, no extra computational cost is required for training at each time-step. However, the added computational cost does impact the overall learning process and is proportional to the number of times new actions are introduced.

Discussion and Conclusion

In this work we established ﬁrst steps towards developing the lifelong MDP setup for dealing with action sets that change over time. Our proposed approach then leveraged the structure in the action space such that an existing policy can be efﬁciently adapted to the new set of available actions. Superior performances on both synthetic and large-scale realworld environments demonstrate the beneﬁts of the proposed

LAICA algorithm.

To the best of our knowledge, this is the ﬁrst work to take a step towards addressing the problem of lifelong learning with a changing action set. We hope that this brings more attention to such understudied problems in lifelong learning. There are several important challenges open for future investigation.

In many real-world applications, often due to external factors, some actions are removed over time as well. For example, if a medicine becomes outdated, if a product is banned, etc. While our applications were devoid of this aspect, the proposed algorithm makes use of a policy parameterization that is invariant to the cardinality of the action set, and thus can support both addition and removal. Our proposed policy decomposition method can still be useful for selecting an available action whose impact on the state transition is most similar to the removed action.

There can be several applications where new actions that are added over time have no relation to the previously observed actions. For example, completely new product categories, tutorial videos on new topics, etc. In such cases, it is unclear how to leverage past information efﬁciently. We do not expect our proposed method to work well in such settings.

Acknowledgement The research was supported by and partially conducted at Adobe Research. We are also immensely grateful to the three anonymous reviewers who shared their insights and feedback.

References Bajpai, A. N.; Garg, S.; et al. 2018. Transfer of deep reactive policies for mdp planning. In Advances in Neural Information Processing Systems, 10965 10975. Boutilier, C.; Cohen, A.; Daniely, A.; Hassidim, A.; Mansour, Y.; Meshi, O.; Mladenov, M.; and Schuurmans, D. 2018. Planning and learning with stochastic action sets. In IJCAI. Chandak, Y.; Theocharous, G.; Kostas, J.; Jordan, S.; and Thomas, P. S. 2019. Learning action representations for reinforcement learning. International Conference on Machine Learning. Chen, Z., and Liu, B. 2016. Lifelong machine learning. Synthesis Lectures on Artiﬁcial Intelligence and Machine Learning 10(3):1 145. Duan, Y.; Schulman, J.; Chen, X.; Bartlett, P. L.; Sutskever, I.; and Abbeel, P. 2016. Rlˆ2: Fast reinforcement learning via slow reinforcement learning. ar Xiv preprint ar Xiv:1611.02779. Dulac-Arnold, G.; Evans, R.; van Hasselt, H.; Sunehag, P.; Lillicrap, T.; Hunt, J.; Mann, T.; Weber, T.; Degris, T.; and Coppin, B. 2015. Deep reinforcement learning in large discrete action spaces. ar Xiv preprint ar Xiv:1512.07679. Ferreira, L. A.; Bianchi, R. A.; Santos, P. E.; and de Mantaras, R. L. 2017. Answer set programming for non-stationary markov decision processes. Applied Intelligence 47(4):993 1007. Finn, C.; Abbeel, P.; and Levine, S. 2017. Model-agnostic meta-learning for fast adaptation of deep networks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70. JMLR. org. French, R. M. 1999. Catastrophic forgetting in connectionist networks. Trends in cognitive sciences 3(4):128 135. Gabel, T., and Riedmiller, M. 2008. Reinforcement learning for dec-mdps with changing action sets and partially ordered dependencies. In Proceedings of the 7th international joint conference on Autonomous agents and multiagent systems-Volume 3, 1333 1336. International Foundation for Autonomous Agents and Multiagent Systems. Gajane, P.; Ortner, R.; and Auer, P. 2018. A sliding-window algorithm for markov decision processes with arbitrarily changing rewards and transitions. ar Xiv preprint ar Xiv:1805.10066. Gupta, A.; Mendonca, R.; Liu, Y.; Abbeel, P.; and Levine, S. 2018. Meta-reinforcement learning of structured exploration strategies. In Advances in Neural Information Processing Systems, 5302 5311. He, J.; Chen, J.; He, X.; Gao, J.; Li, L.; Deng, L.; and Ostendorf, M. 2015. Deep reinforcement learning with a natural language action space. ar Xiv preprint ar Xiv:1511.04636. Kingma, D. P., and Welling, M. 2013. Auto-encoding variational bayes. ar Xiv preprint ar Xiv:1312.6114. Kirkpatrick, J.; Pascanu, R.; Rabinowitz, N.; Veness, J.; Desjardins, G.; Rusu, A. A.; Milan, K.; Quan, J.; Ramalho, T.;

Grabska-Barwinska, A.; et al. 2017. Overcoming catastrophic forgetting in neural networks. Proceedings of the national academy of sciences 114(13):3521 3526. Lopez-Paz, D., et al. 2017. Gradient episodic memory for continual learning. In Advances in Neural Information Processing Systems, 6467 6476. Mandel, T.; Liu, Y.-E.; Brunskill, E.; and Popovi c, Z. 2017. Where to add actions in human-in-the-loop reinforcement learning. In Thirty-First AAAI Conference on Artiﬁcial Intelligence. Nachum, O.; Gu, S.; Lee, H.; and Levine, S. 2018. Nearoptimal representation learning for hierarchical reinforcement learning. ar Xiv preprint ar Xiv:1810.01257. Neu, G. 2013. Online learning in non-stationary markov decision processes. Co RR. Puterman, M. L. 2014. Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons. Ruvolo, P., and Eaton, E. 2013. Ella: An efﬁcient lifelong learning algorithm. In International Conference on Machine Learning, 507 515. Shani, G.; Heckerman, D.; and Brafman, R. I. 2005. An mdpbased recommender system. Journal of Machine Learning Research 6(Sep):1265 1295. Silver, D.; Lever, G.; Heess, N.; Degris, T.; Wierstra, D.; and Riedmiller, M. 2014. Deterministic policy gradient algorithms. In ICML. Silver, D. L.; Yang, Q.; and Li, L. 2013. Lifelong machine learning systems: Beyond learning algorithms. In 2013 AAAI spring symposium series. Sutton, R. S., and Barto, A. G. 2018. Reinforcement learning: An introduction. MIT press. Sutton, R. S.; Mc Allester, D. A.; Singh, S. P.; and Mansour, Y. 2000. Policy gradient methods for reinforcement learning with function approximation. In Advances in neural information processing systems, 1057 1063. Sutton, R. S.; Precup, D.; and Singh, S. 1999. Between mdps and semi-mdps: A framework for temporal abstraction in reinforcement learning. Artiﬁcial intelligence 112(1-2):181 211. Tennenholtz, G., and Mannor, S. 2019. The natural language of actions. International Conference on Machine Learning. Thrun, S. 1998. Lifelong learning algorithms. In Learning to learn. Springer. 181 209. Wang, J.; Kurth-Nelson, Z.; Tirumala, D.; Soyer, H.; Leibo, J.; Munos, R.; Blundell, C.; Kumaran, D.; and Botivnick, M. 2017. Learning to reinforcement learn. arxiv 1611.05763. Xu, Z.; van Hasselt, H. P.; and Silver, D. 2018. Meta-gradient reinforcement learning. In Advances in neural information processing systems. Zenke, F.; Poole, B.; and Ganguli, S. 2017. Continual learning through synaptic intelligence. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, 3987 3995. JMLR. org.