# modelbased_imitation_learning_for_urban_driving__6a896929.pdf Model-Based Imitation Learning for Urban Driving Anthony Hu1,2 Gianluca Corrado1 Nicolas Griffiths1 Zak Murez1 Corina Gurau1 Hudson Yeo1 Alex Kendall1 Roberto Cipolla2 Jamie Shotton1 1Wayve, UK. 2University of Cambridge, UK. research@wayve.ai An accurate model of the environment and the dynamic agents acting in it offers great potential for improving motion planning. We present MILE: a Model-based Imitation LEarning approach to jointly learn a model of the world and a policy for autonomous driving. Our method leverages 3D geometry as an inductive bias and learns a highly compact latent space directly from high-resolution videos of expert demonstrations. Our model is trained on an offline corpus of urban driving data, without any online interaction with the environment. MILE improves upon prior state-of-the-art by 31% in driving score on the CARLA simulator when deployed in a completely new town and new weather conditions. Our model can predict diverse and plausible states and actions, that can be interpretably decoded to bird s-eye view semantic segmentation. Further, we demonstrate that it can execute complex driving manoeuvres from plans entirely predicted in imagination. Our approach is the first camera-only method that models static scene, dynamic scene, and ego-behaviour in an urban driving environment. The code and model weights are available at https://github.com/wayveai/mile. 1 Introduction From an early age we start building internal representations of the world through observation and interaction [1]. Our ability to estimate scene geometry and dynamics is paramount to generating complex and adaptable movements. This accumulated knowledge of the world, part of what we often refer to as common sense, allows us to navigate effectively in unfamiliar situations [2]. In this work, we present MILE, a Model-based Imitation LEarning approach to jointly learn a model of the world and a driving policy. We demonstrate the effectiveness of our approach in the autonomous driving domain, operating on complex visual inputs labelled only with expert action and semantic segmentation. Unlike prior work on world models [3, 4, 5], our method does not assume access to a ground truth reward, nor does it need any online interaction with the environment. Further, previous environments in Open AI Gym [3], Mu Jo Co [4], and Atari [5] were characterised by simplified visual inputs as small as 64 64 images. In contrast, MILE operates on high-resolution camera observations of urban driving scenes. Driving inherently requires a geometric understanding of the environment, and MILE exploits 3D geometry as an important inductive bias by first lifting image features to 3D and pooling them into a bird s-eye view (Be V) representation. The evolution of the world is modelled by a latent dynamics model that infers compact latent states from observations and expert actions. The learned latent state is the input to a driving policy that outputs vehicle control, and can additionally be decoded to Be V segmentation for visualisation and as a supervision signal. 36th Conference on Neural Information Processing Systems (Neur IPS 2022). Our method also relaxes the assumption made in some recent work [6, 7] that neither the agent nor its actions influence the environment. This assumption rarely holds in urban driving, and therefore MILE is action-conditioned, allowing us to model how other agents respond to ego-actions. We show that our model can predict plausible and diverse futures from latent states and actions over long time horizons. It can even predict entire driving plans in imagination to successfully execute complex driving manoeuvres, such as negotiating a roundabout, or swerving to avoid a motorcyclist (see videos in the supplementary material). We showcase the performance of our model on the driving simulator CARLA [8], and demonstrate a new state-of-the-art. MILE achieves a 31% improvement in driving score with respect to previous methods [9, 10] when tested in a new town and new weather conditions. Finally, during inference, because we model time with a recurrent neural network, we can maintain a single state that summarises all the past observations and then efficiently update the state when a new observation is available. We demonstrate that this design decision has important benefits for deployment in terms of latency, with negligible impact on the driving performance. To summarise the main contributions of this paper: We introduce a novel model-based imitation learning architecture that scales to the visual complexity of autonomous driving in urban environments by leveraging 3D geometry as an inductive bias. Our method is trained solely using an offline corpus of expert driving data, and does not require any interaction with an online environment or access to a reward, offering strong potential for real-world application. Our camera-only model sets a new state-of-the-art on the CARLA simulator, surpassing other approaches, including those requiring Li DAR inputs. Our model predicts a distribution of diverse and plausible futures states and actions. We demonstrate that it can execute complex driving manoeuvres from plans entirely predicted in imagination. 2 Related Work Our work is at the intersection of imitation learning, 3D scene representation, and world modelling. Imitation learning. Despite that the first end-to-end method for autonomous driving was envisioned more than 30 years ago [11], early autonomous driving approaches were dominated by modular frameworks, where each module solves a specific task [12, 13, 14]. Recent years have seen the development of several end-to-end self-driving systems that show strong potential to improve driving performance by predicting driving commands from high-dimensional observations alone. Conditional imitation learning has proven to be one successful method to learn end-to-end driving policies that can be deployed in simulation [15] and real-world urban driving scenarios [16]. Nevertheless, difficulties of learning end-to-end policies from high-dimensional visual observations and expert trajectories alone have been highlighted [17]. Several works have attempted to overcome such difficulties by moving past pure imitation learning. DAgger [18] proposes iterative dataset aggregation to collect data from trajectories that are likely to be experienced by the policy during deployment. NEAT [19] additionally supervises the model with Be V semantic segmentation. Chauffeur Net [20] exposes the learner to synthesised perturbations of the expert data in order to produce more robust driving policies. Learning from All Vehicles (LAV) [10] boosts sample efficiency by learning behaviours from not only the ego vehicle, but from all the vehicles in the scene. Roach [9] presents an agent trained with supervision from a reinforcement learning coach that was trained on-policy and with access to privileged information. 3D scene representation. Successful planning for autonomous driving requires being able to understand and reason about the 3D scene, and this can be challenging from monocular cameras. One common solution is to condense the information from multiple cameras to a single bird s-eye representation of the scene. This can be achieved by lifting each image in 3D (by learning a depth distribution of features) and then splatting all frustums into a common rasterised Be V grid [21, 22, 23]. An alternative approach is to rely on transformers to learn the direct mapping from image to bird s-eye view [24, 25, 26], without explicitly modelling depth. World models. Model-based methods have mostly been explored in a reinforcement learning setting and have been shown to be extremely successful [3, 27, 5]. These methods assume access to a reward, and online interaction with the environment, although progress has been made on fully offline reinforcement learning [28, 29]. Model-based imitation learning has emerged as an alternative to reinforcement learning in robotic manipulation [30] and Open AI Gym [31]. Even though these methods do not require access to a reward, they still require online interaction with the environment to achieve good performance. Learning the latent dynamics of a world model from image observations was first introduced in video prediction [32, 33, 34]. Most similar to our approach, [4, 5] additionally modelled the reward function and optimised a policy inside their world model. Contrarily to prior work, our method does not assume access to a reward function, and directly learns a policy from an offline dataset. Additionally, previous methods operate on simple visual inputs, mostly of size 64 64. In contrast, MILE is able to learn the latent dynamics of complex urban driving scenes from high resolution 600 960 input observations, which is important to ensure small details such as traffic lights can be perceived reliably. Trajectory forecasting. The goal of trajectory forecasting is to estimate the future trajectories of dynamic agents using past physical states (e.g. position, velocity), and scene context (e.g. as an offline HD map) [35, 36, 37, 38]. World models build a latent representation of the environment that explains the observations from the sensory inputs of the ego-agent (e.g. camera images) conditioned on their actions. While trajectory forecasting methods only model the dynamic scene, world models jointly reason on static and dynamic scenes. The future trajectories of moving agents is implicitly encoded in the learned latent representation of the world model, and could be explicitly decoded given we have access to future trajectory labels. [35, 37, 38] forecast the future trajectory of moving agents, but did not control the ego-agent. They focused on the prediction problem and not on learning expert behaviour from demonstrations. [39] inferred future trajectories of the ego-agent from expert demonstrations, and conditioned on some specified goal to perform new tasks. [36] extended their work to jointly model the future trajectories of moving agents as well as of the ego-agent. Our proposed model jointly models the motion of other dynamics agents, the behaviour of the ego-agent, as well as the static scene. Contrary to prior work, we do not assume access to ground truth physical states (position, velocity) or to an offline HD map for scene context. Our approach is the first camera-only method that models static scene, dynamic scene, and ego-behaviour in an urban driving environment. 3 MILE: Model-based Imitation LEarning In this section, we present MILE: our method that learns to jointly control an autonomous vehicle and model the world and its dynamics. An overview of the architecture is presented in Figure 1 and the full description of the network can be found in Appendix C. We begin by defining the generative model (Section 3.1), and then derive the inference model (Section 3.2). Section 3.3 and Section 3.4 describe the neural networks that parametrise the inference and generative models respectively. Finally, in Section 3.5 we show how our model can predict future states and actions to drive in imagination. 3.1 Probabilistic Generative Model Let o1:T be a sequence of T video frames with associated expert actions a1:T and ground truth Be V semantic segmentation labels y1:T . We model their evolution by introducing latent variables s1:T that govern the temporal dynamics. The initial distribution is parameterised as s1 N(0, I), and we additionally introduce a variable h1 δ(0) that serves as a deterministic history. The transition consists of (i) a deterministic update ht+1 = fθ(ht, st) that depends on the past history ht and past state st, followed by (ii) a stochastic update st+1 N(µθ(ht+1, at), σθ(ht+1, at)I), where we parameterised st as a normal distribution with diagonal covariance. We model these transitions with neural networks: fθ is a gated recurrent cell, and (µθ, σθ) are multi-layer perceptrons. The full probabilistic model is given by Equation (1). Figure 1: Architecture of MILE. (i) The goal is to infer the latent dynamics (h1:T , s1:T ) that generated the observations o1:T , the expert actions a1:T and the bird s-eye view labels y1:T . The latent dynamics contains a deterministic history ht and a stochastic state st. (ii) The inference model, with parameters ϕ, estimates the posterior distribution of the stochastic state q(st|o t, a