# temporally_abstract_partial_models__43550497.pdf Temporally Abstract Partial Models Khimya Khetarpal 1,2, Zafarali Ahmed 3, Gheorghe Comanici 3, Doina Precup1,2,3 1Mc Gill University, 2Mila, 3Deep Mind Humans and animals have the ability to reason and make predictions about different courses of action at many time scales. In reinforcement learning, option models (Sutton, Precup & Singh, 1999; Precup, 2000) provide the framework for this kind of temporally abstract prediction and reasoning. Natural intelligent agents are also able to focus their attention on courses of action that are relevant or feasible in a given situation, sometimes termed affordable actions. In this paper, we define a notion of affordances for options, and develop temporally abstract partial option models, that take into account the fact that an option might be affordable only in certain situations. We analyze the trade-offs between estimation and approximation error in planning and learning when using such models, and identify some interesting special cases. Additionally, we empirically demonstrate the ability to learn both affordances and partial option models online resulting in improved sample efficiency and planning time in the Taxi domain. 1 Introduction Intelligent agents flexibly reason about the applicability and effects of their actions over different time scales, which in turn allows them to consider different courses of action. Yet modeling the entire complexity of a realistic environment is quite difficult and requires a lot of data (Kakade et al., 2003). Animals and people exhibit a powerful ability to control the modelling process by understanding which actions deserve any consideration at all in a situation. By anticipating only certain aspects of their effects over different time horizons may make models more predictable or easier to learn. In this paper we develop the theoretical underpinnings of how such an ability could be defined and studied in sequential decision making. We work in the context of model-based reinforcement learning (MBRL) (Sutton and Barto, 2018) and temporal abstraction in the framework of options Sutton et al. (1999). Theories of embodied cognition and perception suggest that humans are able to represent the world knowledge in the form of internal models across different time scales (Pezzulo and Cisek, 2016). Option models provide a framework for RL agents to exhibit the same capability. Options define a way of behaving, including a set of states in which an option can start, an internal policy that is used to make decisions while the option is executing, and a stochastic, state-dependent termination condition. Models of options predict the (discounted) reward that an option would receive over time and the (discounted) probability distribution over the states attained at termination (Sutton et al., 1999). Consequently, option models enable the extension of dynamic programming and many other RL planning methods in order to achieve temporal abstraction, i.e. to be able to consider seamlessly different time scales of decision-making. Much of the work on learning and planning with options considers the case where they apply everywhere (Bacon et al., 2017; Harb et al., 2017; Harutyunyan et al., 2019b,a), with some notable recent exceptions which generalize the notion of initiation sets in the context of function approximation (Khetarpal et al., 2020b). Having options that are partially defined is very important in order to control the complexity of the planning and exploration process. However, the notion of partially defined option models, which make predictions only from a subset of states is the focus of our paper. Correspondence to khimya.khetarpal@mail.mcgill.ca 35th Conference on Neural Information Processing Systems (Neur IPS 2021). In natural intelligence, the ability to make predictions across different scales is linked with the ability to understand the action possibilities (i.e. affordances) (Gibson, 1977) which arise at the interface of an agent and an environment and are a key component of successful adaptive control (Fikes et al., 1972; Korf, 1983; Drescher, 1991; Cisek and Kalaska, 2010). Recent work (Khetarpal et al., 2020a) has described a way to implement affordances in RL agents, by formalizing a notion of intent over state space, and then defining an affordance as the set of state-action pairs that achieve that intent to a certain degree. One can then plan with partial, approximate models that map affordances to intents, incurring a quantifiable amount of error at the benefit of faster learning and deliberation. In this paper, we generalize the notion of intents and affordances to option models. As we will see in Sec. 3, this is non-trivial and requires carefully inspecting the definition of option models. The resulting temporally abstract models are partial, in the sense that they apply only in certain states and options. Key Contributions. We present a framework defining temporally extended intents, affordances and abstract partial option models (Sec. 3). We derive theoretical results quantifying the loss incurred when using such models for planning, exposing trade-offs between single-step models and full option models (Sec. 4). Our theoretical guarantees provide insights and decouple the role of affordances from temporal abstraction. Empirically, we demonstrate end-to-end learning of affordances and partial option models, showcasing significant improvement in final performance and sample efficiency when used for planning in the Taxi domain (Sec. 5). 2 Background In RL, a decision-making agent interacts with an environment through a sequence of actions, in order to learn a way of behaving (aka policy) that maximizes its value, i.e. long-term expected return (Sutton and Barto, 2018). This process is typically formalized as a Markov Decision Process (MDP). A finite MDP is a tuple M = S, A, r, P, γ , where S is a finite set of states, A is a finite set of actions, r : S A [0, Rmax] is the reward function, P : S A Dist(S) is the transition dynamics, mapping state-action pairs to a distribution over next states, and γ [0, 1) is the discount factor. At each time step t, the agent observes a state st S and takes an action at A drawn from its policy π : S Dist(A) and, with probability P(st+1|st, at), enters the next state st+1 S while receiving a numerical reward r(st, at). The value function of policy π in state s is the expectation of the long-term return obtained by executing π from s, defined as: V π(s) = E P t=0 γtr(St, At) S0 = s, At π( |St), St+1 P( |St, At) t . The goal of the agent is to find an optimal policy, π = arg maxπ V π. If the model of the MDP, consisting of r and P, is given, the value iteration algorithm can be used to obtain the optimal value function, V , by computing the fixed-point of the Bellman equations (Bellmann, 1957): V (s) = maxa r(s, a) + γ P s P(s |s, a)V (s ) , s. The optimal policy π can be obtained by acting greedily with respect to V . Semi-Markov Decision Process (SMDP). An SMDP (Puterman, 1994) is a generalization of MDPs, in which the amount of time between two decision points is a random variable. The transition model of the environment is therefore a joint distribution over the next decision state and the time, conditioned on the current state and action. SMDPs obey Bellman equations similar to those for MDPs. Options. Options (Sutton et al., 1999) provide a framework for temporal abstraction which builds on SMDPs, but also leverages the fact that the agent acts in an underlying MDP. A Markovian option o is composed of an intra-option policy πo, a termination condition βo : S Dist(S), where βo(s) is the probability of terminating the option upon entering s, and an initiation set Io S. Let Ωbe the set of all options. In this document, we will use O Ωto denote the set of options available to the agent and O(s) = {o|s Io} denote the set of options available at state s. In call-and-return option execution, when an agent is at a decision point, it examines its current state s, chooses o O(s) according to a policy over options πΩ(s), then follows the internal policy πo, until the option terminates according to βo. Termination yields a new decision point, where this process is repeated. Option Models. The model of an option o predicts its reward and transition dynamics following a state s Io, as follows: r(s, o) .= E[Rt+1 + γRt+2 + + γk 1Rt+k|St = s, Ot = o], and p(s |s, o) .= P k=1 Pr(Sk = s , Tk = 1, T0 k] at some threshold value, k. When k = 0, all state and options are affordable. The affordance classifier is learned at the same time as the option model, ˆ M, using the standard cross entropy objective: P I I c(s, o, s , I) log A(s, o, s , I) where c(s, o, s , I) is the intent completion function indicating if intent I was completed during the transition. The threshold, k, controls the size of the affordance set (Fig. 5(a)) with larger k s resulting in smaller affordance sets. The learned affordance set for Pickup+Drop@Goal is 2,000 state-option pairs which smaller than what we heuristically defined (4,000 state-option pairs). Smaller affordance sets result in improved sample efficiency (Fig. 5(b)). We highlight that this is not necessarily obvious since the learned affordance sets could remove potentially useful state-options pairs and k would be used to control how restrictive the sets are. These results show that affordances can be learned online for a defined set of intents and result in good performance. In particular, there are sample efficiency gains by using more restricted affordance sets. Our results here demonstrate empirically that learning a partial option model requires much fewer samples as opposed to learning a full model. We also corroborate this with theoretical guarantees on sample and computational complexity of obtaining an ε-estimation of the optimal option value function, given only access to a generative model (See Appendix Sec. C). Figure 5: The impact of learning the affordance set for Pickup+Drop@Goal on (a) size of the affordance set and (b) success in the downstream task. There is a one-to-one correspondence between the threshold, k, the affordance set size and the success rate on the taxi task. The learned affordance set for Pickup+Drop@Goal is smaller than the heuristic used in Fig. 3(c). 6 Related Work Affordances are viewed as the action opportunities (Gibson, 1977; Chemero, 2003), emerging out of the agent-environment interaction (Heft, 1989), and have been typically studied in AI as possibilities associated with an object (Slocum et al., 2000; Fitzpatrick et al., 2003; Lopes et al., 2007; Montesano et al., 2008; Cruz et al., 2016, 2018; Fulda et al., 2017; Song et al., 2015; Abel et al., 2014). Affordances have also been formalized in RL without the assumption of objects (Khetarpal et al., 2020a). Our work presents the general case of temporal abstraction (Sutton et al., 1999). The process model of behavior and cognition (Pezzulo and Cisek, 2016) in the space of affordances is expressed at multiple levels of abstraction. During interactive behavior, action representations at different levels of abstraction can indeed be mapped to findings about the way in which the human brain adaptively selects among predictions of outcomes at different time scales (Cisek and Kalaska, 2010; Pezzulo and Cisek, 2016). In RL, the generalization of one-step action models to option models (Sutton et al., 1999) enables an agent to predict and reason at multiple time scales. Precup et al. (1998) established dynamic programming results for option models which enjoy similar theoretical guarantees as primitive action models. Abel et al. (2019) proposed expected-length models of options. Our theoretical results can also be extended to expected-length option models. Building agents that can represent and use predictive knowledge requires efficient solutions to cope with the combinatorial explosion of possibilities, especially in large environments. Partial models (Talvitie and Singh, 2009) provide an elegant solution to this problem, as they only model part of the observation. Existing methods focus on predictions for only some of the observations (Oh et al., 2017; Amos et al., 2018; Guo et al., 2018; Gregor et al., 2019; Zhao et al., 2021), but they still model the effects of all actions and focus on single-step dynamics (Watters et al., 2019). Recent work by Xu et al. (2020) proposed a deep RL approach to learn partial models with goals akin to intents, which is complementary to our work. 7 Conclusions and Limitations We presented notions of intents and affordances that can be used together with options. They allow us to define temporally abstract partial models, which extend option models to be conditioned on affordances. Our theoretical analysis suggests that modelling temporally extended dynamics for only relevant parts of the environment-agent interface provides two-fold benefits: 1) faster planning across different timescales (Sec. 4), and 2) improved sampled efficiency (Appendix Sec. C). However, these benefits can come at the cost of some increase in approximation bias, but this tradeoff can still be favourable. For example, in the low-data regime, intermediate-size affordances (much smaller than the entire state-option space) could really improve the speed of planning. Picking intents judiciously can also induce sample complexity gains, if the approximation error due to the intent is manageable. Our empirical illustration shows that our approach can produce significant benefits. Limitations & Future Work. Our analysis assumes that the intents and options are fixed apriori. To learn intents, we envisage an iterative algorithm which alternates between learning intents and affordances, such that intents can be refined over time and the mis-specifications can also be selfcorrected (Talvitie, 2017). Our analysis is complimentary to any method for providing or discovering intents. Another important future direction is to build partial option models and leverage their predictions in large scale problems (Vinyals et al., 2019). Besides, it would be useful to relate our work to cognitive science models of intentional options, which can reason about the space of future affordances (Pezzulo and Cisek, 2016). Aligned with future affordances, a promising research avenue is to study the emergence of new affordances at the boundary of the agent-environment interaction in the presence of non-stationarity (Chandak et al., 2020). Acknowledgments and Disclosure of Funding The authors would like to thank Feryal Behbahani and Dave Abel for a very detailed feedback, Martin Klissarov and Emmanuel Bengio for valuable comments on a draft of this paper, and Joelle Pineau for feedback on ideas presented in this work. A special thank you to Ahmed Touati for discussion and detailed notes (Azar et al., 2012) presented in RL theory reading group at Mila. Abel, D., Barth-Maron, G., Mac Glashan, J., and Tellex, S. (2014). Toward affordance-aware planning. In First Workshop on Affordances: Affordances in Vision for Cognitive Robotics. Abel, D., Hershkowitz, D. E., Barth-Maron, G., Brawner, S., O Farrell, K., Mac Glashan, J., and Tellex, S. (2015). Goal-based action priors. In Twenty-Fifth International Conference on Automated Planning and Scheduling. Abel, D., Winder, J., des Jardins, M., and Littman, M. L. (2019). The expected-length model of options. In International Joint Conference on Artificial Intelligence. Amos, B., Dinh, L., Cabi, S., Rothörl, T., Colmenarejo, S. G., Muldal, A., Erez, T., Tassa, Y., de Freitas, N., and Denil, M. (2018). Learning awareness models. ar Xiv preprint ar Xiv:1804.06318. Azar, M. G., Munos, R., and Kappen, B. (2012). On the sample complexity of reinforcement learning with a generative model. ar Xiv preprint ar Xiv:1206.6461. Bacon, P.-L., Harb, J., and Precup, D. (2017). The option-critic architecture. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 1726 1734. Bellmann, R. (1957). Dynamic programming princeton university press. Princeton, NJ. Chandak, Y., Theocharous, G., Nota, C., and Thomas, P. (2020). Lifelong learning with a changing action set. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 3373 3380. Chemero, A. (2003). An outline of a theory of affordances. Ecological psychology, 15(2):181 195. Cisek, P. and Kalaska, J. F. (2010). Neural mechanisms for interacting with a world full of action choices. Annual review of neuroscience, 33:269 298. Cruz, F., Magg, S., Weber, C., and Wermter, S. (2016). Training agents with interactive reinforcement learning and contextual affordances. IEEE Transactions on Cognitive and Developmental Systems, 8(4):271 284. Cruz, F., Parisi, G. I., and Wermter, S. (2018). Multi-modal feedback for affordance-driven interactive reinforcement learning. In 2018 International Joint Conference on Neural Networks (IJCNN), pages 1 8. IEEE. Dietterich, T. G. (2000). Hierarchical reinforcement learning with the maxq value function decomposition. Journal of Artificial Intelligence Research, 13:227 303. Diuk, C., Cohen, A., and Littman, M. L. (2008). An object-oriented representation for efficient reinforcement learning. In Proceedings of the 25th international conference on Machine learning, pages 240 247. ACM. Drescher, G. L. (1991). Made-up Minds: A Constructivist Approach to Artificial Intelligence. MIT Press, Cambridge, MA, USA. Fikes, R. E., Hart, P. E., and Nilsson, N. J. (1972). Learning and executing generalized robot plans. Artificial Intelligence. Fitzpatrick, P., Metta, G., Natale, L., Rao, S., and Sandini, G. (2003). Learning about objects through action-initial steps towards artificial cognition. In 2003 IEEE International Conference on Robotics and Automation (Cat. No. 03CH37422), volume 3, pages 3140 3145. IEEE. Fulda, N., Ricks, D., Murdoch, B., and Wingate, D. (2017). What can you do with a rock? affordance extraction via word embeddings. ar Xiv preprint ar Xiv:1703.03429. Gibson, J. J. (1977). The theory of affordances. Hilldale, USA, 1(2). Gregor, K., Rezende, D. J., Besse, F., Wu, Y., Merzic, H., and van den Oord, A. (2019). Shaping belief states with generative environment models for rl. In Advances in Neural Information Processing Systems, pages 13475 13487. Guo, Z. D., Azar, M. G., Piot, B., Pires, B. A., and Munos, R. (2018). Neural predictive belief representations. ar Xiv preprint ar Xiv:1811.06407. Harb, J., Bacon, P.-L., Klissarov, M., and Precup, D. (2017). When waiting is not an option: Learning options with a deliberation cost. ar Xiv preprint ar Xiv:1709.04571. Harutyunyan, A., Dabney, W., Borsa, D., Heess, N., Munos, R., and Precup, D. (2019a). The termination critic. ar Xiv preprint ar Xiv:1902.09996. Harutyunyan, A., Vrancx, P., Hamel, P., Nowé, A., and Precup, D. (2019b). Per-decision option discounting. In International Conference on Machine Learning, pages 2644 2652. PMLR. Heft, H. (1989). Affordances and the body: An intentional analysis of gibson s ecological approach to visual perception. Journal for the theory of social behaviour, 19(1):1 30. Jiang, N., Kulesza, A., Singh, S., and Lewis, R. (2015). The dependence of effective planning horizon on model accuracy. In Proceedings of the 2015 International Conference on Autonomous Agents and Multiagent Systems, pages 1181 1189. International Foundation for Autonomous Agents and Multiagent Systems. Kakade, S. M. et al. (2003). On the sample complexity of reinforcement learning. Ph D thesis, University of London London, England. Kearns, M. J. and Singh, S. P. (1999). Finite-sample convergence rates for q-learning and indirect algorithms. In Advances in neural information processing systems, pages 996 1002. Khetarpal, K., Ahmed, Z., Comanici, G., Abel, D., and Precup, D. (2020a). What can i do here? A theory of affordances in reinforcement learning. In Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pages 5243 5253. Khetarpal, K., Klissarov, M., Chevalier-Boisvert, M., Bacon, P.-L., and Precup, D. (2020b). Options of interest: Temporal abstraction with interest functions. Proceedings of the AAAI Conference on Artificial Intelligence, pages 4444 4451. Korf, R. E. (1983). Learning to Solve Problems by Searching for Macro-operators. Ph D thesis, Pittsburgh, PA, USA. Lawlor, J. (2020). jakelawlor/pnwcolors: A pacific northwest inspired r color palette package. Lopes, M., Melo, F. S., and Montesano, L. (2007). Affordance-based imitation learning in robots. In 2007 IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 1015 1021. IEEE. Montesano, L., Lopes, M., Bernardino, A., and Santos-Victor, J. (2008). Learning object affordances: from sensory motor coordination to imitation. IEEE Transactions on Robotics, 24(1):15 26. Oh, J., Singh, S., and Lee, H. (2017). Value prediction network. In Advances in Neural Information Processing Systems, pages 6118 6128. Pezzulo, G. and Cisek, P. (2016). Navigating the affordance landscape: feedback control as a process model of behavior and cognition. Trends in cognitive sciences, 20(6):414 424. Precup, D., Sutton, R. S., and Singh, S. (1998). Theoretical results on reinforcement learning with temporally abstract options. In European conference on machine learning, pages 382 393. Springer. Puterman, M. (1994). Markov decision processes. 1994. Jhon Wiley & Sons, New Jersey. Slocum, A. C., Downey, D. C., and Beer, R. D. (2000). Further experiments in the evolution of minimally cognitive behavior: From perceiving affordances to selective attention. In From animals to animats 6: Proceedings of the sixth international conference on simulation of adaptive behavior, pages 430 439. Song, H. O., Fritz, M., Goehring, D., and Darrell, T. (2015). Learning to detect visual grasp affordance. IEEE Transactions on Automation Science and Engineering, 13(2):798 809. Sutton, R. S. and Barto, A. G. (2018). Reinforcement learning: An introduction. MIT press. Sutton, R. S., Precup, D., and Singh, S. (1999). Between mdps and semi-mdps: A framework for temporal abstraction in reinforcement learning. Artificial intelligence, 112(1-2):181 211. Talvitie, E. (2017). Self-correcting models for model-based reinforcement learning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 31. Talvitie, E. and Singh, S. P. (2009). Simple local models for complex dynamical systems. In Advances in Neural Information Processing Systems, pages 1617 1624. Vinyals, O., Babuschkin, I., Czarnecki, W. M., Mathieu, M., Dudzik, A., Chung, J., Choi, D. H., Powell, R., Ewalds, T., Georgiev, P., et al. (2019). Grandmaster level in starcraft ii using multi-agent reinforcement learning. Nature, 575(7782):350 354. Watters, N., Matthey, L., Bosnjak, M., Burgess, C. P., and Lerchner, A. (2019). Cobra: Data-efficient model-based rl through unsupervised object discovery and curiosity-driven exploration. ar Xiv preprint ar Xiv:1905.09275. Xu, D., Mandlekar, A., Martín-Martín, R., Zhu, Y., Savarese, S., and Fei-Fei, L. (2020). Deep affordance foresight: Planning through what can be done in the future. Yang, F., Barth-Maron, G., Sta nczyk, P., Hoffman, M., Liu, S., Kroiss, M., Pope, A., and Rrustemi, A. (2021). Launchpad: A programming model for distributed machine learning research. ar Xiv preprint ar Xiv:2106.04516. Zhao, M., Liu, Z., Luan, S., Zhang, S., Precup, D., and Bengio, Y. (2021). A consciousness-inspired planning agent for model-based reinforcement learning. In Conference on Neural Information Processing Systems. https://arxiv.org/abs/2106.02097.