# efficient_world_models_with_contextaware_tokenization__d609169e.pdf Efficient World Models with Context-Aware Tokenization Vincent Micheli * 1 Eloi Alonso * 1 Franc ois Fleuret 1 Scaling up deep Reinforcement Learning (RL) methods presents a significant challenge. Following developments in generative modelling, modelbased RL positions itself as a strong contender. Recent advances in sequence modelling have led to effective transformer-based world models, albeit at the price of heavy computations due to the long sequences of tokens required to accurately simulate environments. In this work, we propose -IRIS, a new agent with a world model architecture composed of a discrete autoencoder that encodes stochastic deltas between time steps and an autoregressive transformer that predicts future deltas by summarizing the current state of the world with continuous tokens. In the Crafter benchmark, -IRIS sets a new state of the art at multiple frame budgets, while being an order of magnitude faster to train than previous attention-based approaches. We release our code and models at https://github.com/ vmicheli/delta-iris. 1. Introduction Deep Reinforcement Learning (RL) methods have recently delivered impressive results (Ye et al., 2021; Hafner et al., 2023; Schwarzer et al., 2023) in traditional benchmarks (Bellemare et al., 2013; Tassa et al., 2018). In light of the evermore complex domains tackled by the latest generations of generative models (Rombach et al., 2022; Achiam et al., 2023), the prospect of training agents in more ambitious environments (Kanervisto et al., 2022) may hold significant appeal. However, that leap forward poses a serious challenge: deep RL architectures have been comparatively smaller and less sample-efficient than their (self-)supervised counterparts. In contrast, more intricate environments necessitate models with greater representational power and have higher data requirements. *Equal contribution 1University of Geneva, Switzerland. Correspondence to: . Proceedings of the 41 st International Conference on Machine Learning, Vienna, Austria. PMLR 235, 2024. Copyright 2024 by the author(s). Model-based RL (MBRL) (Sutton & Barto, 2018) is hypothesized to be the key for scaling up deep RL agents (Le Cun, 2022). Indeed, world models (Ha & Schmidhuber, 2018) offer a diverse range of capabilities: lookahead search (Schrittwieser et al., 2020; Ye et al., 2021), learning in imagination (Sutton, 1991; Hafner et al., 2023), representation learning (Schwarzer et al., 2021; D Oro et al., 2023), and uncertainty estimation (Pathak et al., 2017; Sekar et al., 2020). In essence, MBRL shifts the focus from the RL problem to a generative modelling problem, where the development of an accurate world model significantly simplifies policy training. In particular, policies learnt in the imagination of world models are freed from sample efficiency constraints, a common limitation of RL agents that is magnified in complex environments with slow rollouts. Recently, the IRIS agent (Micheli et al., 2023) achieved strong results in the Atari 100k benchmark (Bellemare et al., 2013; Kaiser et al., 2020). IRIS introduced a world model composed of a discrete autoencoder and an autoregressive transformer, casting dynamics learning as a sequence modelling problem where the transformer composes over time a vocabulary of image tokens built by the autoencoder. This approach opened up avenues for future model-based methods to capitalize on advances in generative modelling (Villegas et al., 2022; Achiam et al., 2023), and has already been adopted beyond its original domain (comma.ai, 2023; Hu et al., 2023). However, in its current form, scaling IRIS to more complex environments is computationally prohibitive. Indeed, such an endeavor requires a large number of tokens to encode visually challenging frames. Besides, sophisticated dynamics may require to store numerous time steps in memory to reason about the past, ultimately making the imagination procedure excessively slow. Hence, under these constraints, maintaining a favorable imagined-to-collected data ratio is practically infeasible. In the present work, we introduce -IRIS, a new agent capable of scaling to visually complex environments with lengthier time horizons. -IRIS encodes frames by attending to the ongoing trajectory of observations and actions, effectively describing stochastic deltas between time steps. This enriched conditioning scheme drastically reduces the number of tokens to encode frames, offloads the deterministic aspects of world modelling to the autoencoder, and lets the autoregressive transformer focus on stochastic dynamics. Efficient World Models with Context-Aware Tokenization z1 1 . . . z KI 1 z1 2 . . . z KI 2 x0 a0 z1 1 . . . z K 1 x1 a1 z1 2 . . . z K 2 x1 x2 x0 a0 a1 a2 Figure 1. Discrete autoencoder of IRIS (Micheli et al., 2023) (left) and -IRIS (right). IRIS encodes and decodes frames independently, meaning that zt has to carry all the information necessary to reconstruct xt. On the other hand, -IRIS encoder and decoder are conditioned on past frames and actions, thus zt only has to capture what has changed and that cannot be inferred from actions, i.e. the stochastic delta. This conditioning scheme enables us to drastically reduce the number of tokens required to encode a frame with minimal loss (K KI), which is critical to speed up the autoregressive transformer that predicts future tokens. Nonetheless, substituting the sequence of absolute image tokens with a sequence of -tokens makes the task of the autoregressive model more arduous. In order to predict the next transition, it may only reason over previous -tokens, and thus faces the challenge of integrating over multiple time steps as a way to form a representation of the current state of the world. To resolve this issue, we modify the sequence of the autoregressive model by interleaving continuous I-tokens, that summarize successive world states with frame embeddings, and discrete -tokens. In the Crafter benchmark (Hafner, 2022), -IRIS exhibits favorable scaling properties: the agent solves 17 out of 22 tasks after 10M frames of data collection, supersedes Dreamer V3 (Hafner et al., 2023) at multiple frame budgets, and trains 10 times faster than IRIS. In addition, we include results in the sample-efficient setting with Atari games. Through experiments, we provide evidence that - IRIS learns to disentangle the deterministic and stochastic aspects of world modelling. Moreover, we conduct ablations to validate the new conditioning schemes for the autoencoder and transformer models. We consider a Partially Observable Markov Decision Process (POMDP) (Sutton & Barto, 2018). The transition, reward, and episode termination dynamics are captured by the conditional distributions p(xt+1 | x t, a t) and p(rt, dt | x t, a t), where xt X = R3 h w is an image observation, at A = {1, . . . , A} a discrete action, rt R a scalar reward, and dt {0, 1} indicates episode termination. The reinforcement learning objective is to find a policy pπ(at | x t, a