# diffusion_models_are_realtime_game_engines__9d141756.pdf Published as a conference paper at ICLR 2025 DIFFUSION MODELS ARE REAL-TIME GAME ENGINES Dani Valevski Google Research daniv@google.com Yaniv Leviathan Google Research leviathan@google.com Tel Aviv University moab.arar@gmail.com Shlomi Fruchter Google Deep Mind shlomif@google.com We present Game NGen, the first game engine powered entirely by a neural model that also enables real-time interaction with a complex environment over long trajectories at high quality. When trained on the classic game DOOM, Game NGen extracts gameplay and uses it to generate a playable environment that can interactively simulate new trajectories. Game NGen runs at 20 frames per second on a single TPU and remains stable over extended multi-minute play sessions. Next frame prediction achieves a PSNR of 29.4, comparable to lossy JPEG compression. Human raters are only slightly better than random chance at distinguishing short clips of the game from clips of the simulation, even after 5 minutes of autoregressive generation. Game NGen is trained in two phases: (1) an RL-agent learns to play the game and the training sessions are recorded, and (2) a diffusion model is trained to produce the next frame, conditioned on the sequence of past frames and actions. Conditioning augmentations help ensure stable auto-regressive generation over long trajectories, and decoder fine-tuning improves the fidelity of visual details and text. Figure 1: A human player is playing DOOM on Game NGen at 20 FPS. See supplementary material for multi-minute real-time video recordings of people interactively playing with Game NGen. 1 INTRODUCTION Computer games are manually crafted software systems centered around the following game loop: (1) update the game state based on user input, and (2) render the game state to screen pixels. This game loop, running at high frame rates, creates the illusion of an interactive virtual world for the player. Such game loops are classically run on standard computers, and while there have been many impressive attempts at running games on bespoke hardware (e.g. the iconic game DOOM has been run on kitchen appliances, a treadmill, a camera, and within the game of Minecraft1), in all of these Equal contribution. Work done while at Google Research. 1See https://www.reddit.com/r/itrunsdoom/ Published as a conference paper at ICLR 2025 cases the hardware is still emulating the manually written game software as-is. Furthermore, while vastly different game engines exist, the game state updates and rendering logic in all are composed of a set of manual rules, programmed or configured by hand. In recent years, generative models made significant progress in producing images and videos conditioned on multi-modal inputs, such as text or images. At the forefront of this wave, diffusion models became the de-facto standard in media (i.e. non-language) generation, with works like Dall E (Ramesh et al., 2022), Stable Diffusion (Rombach et al., 2022) and Sora (Brooks et al., 2024). At a glance, simulating the interactive worlds of video games may seem similar to video generation. However, interactive world simulation is more than just very fast video generation. The requirement to condition on a stream of input actions that is only available throughout the generation breaks some assumptions of existing diffusion model architectures. Notably, it requires generating frames autoregressively which tends to be unstable and leads to sampling divergence (see Section 3.2.1). Several important works (Ha & Schmidhuber, 2018; Kim et al., 2020; Bruce et al., 2024) (see Section 6) simulate interactive video games with neural models. Nevertheless, most of these approaches are limited in respect to the complexity of the simulated games, simulation speed, stability over long time periods, or visual quality (see Figure 2). It is therefore natural to ask: Can a neural model running in real-time simulate a complex game at high quality? In this work we demonstrate that the answer is yes. Specifically, we show that a complex video game, the iconic game DOOM, can be run on a neural network, an augmented version of the open Stable Diffusion v1.4 (Rombach et al., 2022), in real-time, while achieving a visual quality comparable to that of the original game. While not an exact simulation (see limitations in Section 7), the neural model is able to perform complex game state updates, such as tallying health and ammo, attacking enemies, damaging objects, opening doors, and more generally persist the game state over long trajectories. Our key contribution is a demonstration that a complex video game (DOOM) can be simulated by a neural network in real time with high quality on a single TPU. We provide concrete architecture and technical insights on how to (1) adapt a text-to-image diffusion model in a stable auto-regressive setup via noise augmentation, (2) achieve visual quality comparable to the original via fine-tuning the latent decoder, and (3) collect training data from an existing game at scale via an RL agent. More broadly, demonstrating that real-time simulation of complex games on existing hardware is possible addresses one important question on the path towards a new paradigm for game engines one where games are automatically generated, much like how images and videos have been generated by neural models in recent years. While bigger questions remain, such as how to use human input to create entirely new games instead of simulating existing ones, we are nevertheless excited for the possibilities of this new paradigm (see Section 7 for further discussion). World Models Figure 2: Game NGen compared to prior simulations of DOOM: World Models Ha & Schmidhuber (2018) and Game GAN Kim et al. (2020). Note that prior models are trained on different data. 2 INTERACTIVE WORLD SIMULATION An Interactive Environment E consists of a space of latent states S, a space of observations of the latent space O, a partial projection function V : S O, a set of actions A, and a transition probability function p(s|a, s ) such that s, s S, a A. Published as a conference paper at ICLR 2025 Observations Encode to latents, add noise aug. and concat Previous Frames Next Frame Prediction Next Frame Figure 3: Method overview (see Section 3). For example, in the case of the game DOOM, S is the program s dynamic memory contents, O is the rendered screen pixels, V is the game s rendering logic, A is the set of key presses, and p is the program s logic given the player s input (including any potential non-determinism). Given an input interactive environment E, and an initial state s0 S, an Interactive World Simulation is a simulation distribution function q(on|o