# chatcam_empowering_camera_control_through_conversational_ai__d36a93e6.pdf Chat Cam: Empowering Camera Control through Conversational AI Xinhang Liu1 Yu-Wing Tai2 Chi-Keung Tang1 1HKUST 2Dartmouth College Cinematographers adeptly capture the essence of the world, crafting compelling visual narratives through intricate camera movements. Witnessing the strides made by large language models in perceiving and interacting with the 3D world, this study explores their capability to control cameras with human language guidance. We introduce Chat Cam, a system that navigates camera movements through conversations with users, mimicking a professional cinematographer s workflow. To achieve this, we propose Cine GPT, a GPT-based autoregressive model for text-conditioned camera trajectory generation. We also develop an Anchor Determinator to ensure precise camera trajectory placement. Chat Cam understands user requests and employs our proposed tools to generate trajectories, which can be used to render high-quality video footage on radiance field representations. Our experiments, including comparisons to state-of-the-art approaches and user studies, demonstrate our approach s ability to interpret and execute complex instructions for camera operation, showing promising applications in real-world production settings. We will release the codebase upon paper acceptance. 1 Introduction Cinematographers skillfully capture the essence of the 3D world by maneuvering their cameras, creating an array of compelling visual narratives [8]. Achieving aesthetically pleasing results requires not only a deep understanding of scene elements and their interplay but also meticulous execution of techniques. Recent progress of large language models (LLMs) [1] has marked a significant milestone in AI development, demonstrating their capability to understand and act within the 3D world [29, 30, 87]. Witnessing this evolution, our work explores the feasibility of empowering camera control through conversational AI, thus enhancing the video production process across diverse domains such as documentary filmmaking, live event broadcasting, and virtual reality experiences. Although the community has devoted considerable effort to controlling the trajectories of objects and cameras in video generation approaches for practical usage [4, 82, 75, 28], or predicting similar sequences through autoregressive decoding processes [35, 64], generating camera trajectories has yet to be explored. This task involves multiple elements such as language, images, 3D assets, and, beyond mere accuracy, necessitates visually pleasing rendered videos as the ultimate goal. We propose Chat Cam, a system that allows users to control camera operations through natural language interaction. As illustrated in Figure 1, leveraging an LLM agent to orchestrate camera operations, our method assists users in generating desired camera trajectories, which can be used to render videos on radiance field representations such as Ne RF [52] or 3DGS [36]. At the core of our approach, we introduce Cine GPT, a GPT-based autoregressive model that integrates language understanding with camera trajectory generation. We train this model using a paired text-trajectory dataset to equip it with the ability for text-conditioned trajectory generation. We also 38th Conference on Neural Information Processing Systems (Neur IPS 2024). First capture the Opera House with the Harbour Bridge in the background. Alright, we can operate the camera like this Let s capture a video for the breathtaking Opera House! Sure! How would you like it? Ascend for a topdown aerial shot, then pan the camera to the opposite side. Next, zoom in while decreasing the focal length, encompassing the city skyline. Stand by, and action! Generated Camera Trajectory & Rendered Video Figure 1: Empowering camera control through conversational AI. Our proposed Chat Cam assists users in generating desired camera trajectories through natural language interactions. The generated trajectories can be used to render videos on radiance field representations such as Ne RF [52] or 3DGS [36]. propose an Anchor Determinator, a module that identifies relevant objects within the 3D scene to serve as anchors, ensuring correct trajectory placement based on user specifications. Our LLM agent parses compositional natural language queries into semantic concepts. With these parsed sub-queries as inputs, the agent then calls our proposed Cine GPT and Anchor Determinator. It composes the final trajectory with the outputs from these tools, which can ultimately be used to render a video that fulfills the user s request. With comprehensive evaluations and comparisons to other state-of-the-art methods, our method exhibits a pronounced ability to interpret and execute complex instructions for camera operation. Our user studies further demonstrate its promising application prospects in actual production settings. In summary, this paper s contributions are as follows: We introduce Chat Cam, a system that, for the first time, enables users to operate cameras through natural language interactions. It simplifies sophisticated camera movements and reduces technical hurdles for creators. We develop Cine GPT for text-conditioned camera trajectory generation and an Anchor Determinator for precise camera trajectory placement. Our LLM agent understands users requests and leverages our proposed tools to complete the task. Extensive experiments demonstrate the effectiveness of our method, showing how AI can effectively collaborate with humans on complex tasks involving multiple elements such as language, images, 3D assets, and camera trajectories. 2 Related Work Multimodal Language Models. Large-scale language models (LLMs) [9, 19, 17, 1, 70] enabled by extensive datasets and model size, have demonstrated surprising emerging abilities. The emergence of multimodal models [43, 42, 33] is captivating as they can process text alongside other modalities such as images [25], audio [27], and videos [78]. Some unified models can perceive inputs and generate outputs in various combinations of text, images, videos, and audio [51, 68, 77, 86]. LLMs hold the potential to act as agents [76, 80, 62], allowing them to be driven by goals, reason about their objectives, devise plans, utilize tools, and interact with and gather feedback from the environment. Our proposed method involves multiple modalities including language, images, 3D fields, and camera trajectories, and utilizes LLMs as agents to assist users in operating cameras. Radiance Field Representations. Utilizing continuous 3D fields modeled by MLPs and volumetric rendering, Neural Radiance Fields (Ne RFs) [52] achieved breakthrough for novel view synthesis. From outside the window, enter the bedroom and explore its layout following an S-shaped path. Observation: The user is requesting a camera trajectory, starting from a location outside the window and then covering the bedroom s layout following an S-shaped path. Reasoning: I need to find a starting point outside the window, then move through the window into the room and capture the room using an S-shaped path. Additionally, I should use a smooth panning speed to better showcase the room s layout. 1. Call Anchor Determinator with capture the bedroom from outside the window . 2. Call Cine GPT with pan straight forward, smooth panning speed . 3. Call Cine GPT with an S-shaped path, smooth panning speed . 4. Connect output trajectories from steps 2 and 3, ensuring they pass through the anchor point determined in step 1. 5. Ask the user for further clarification if needed. 6. Render a video using the generated camera trajectory. Figure 2: Overview of the Chat Cam pipeline. Given a camera operation instruction, Chat Cam reasons the user s request and devises a plan to generate a trajectory using our proposed Cine GPT and Anchor Determinator. The agent then utilizes the outputs from these tools to compose the complete trajectory and render a video. Subsequent research has emerged to improve Ne RFs and broaden their applications [69], including enhancing rendering quality [5, 6, 7? , 47], modeling dynamic scenes [84, 55, 58, 71, 74, 10, 22, 2? ], improving computational efficiency [83, 23], and facilitating 3D scene editing [45, 84, 73, 34, 46, 40]. Replacing the deep MLPs with a feature voxel grid network has demonstrated enhancement in both training and inference speed [13, 66, 24, 54]. More recently, 3D Gaussian Splatting [36] has further advanced visual quality and rendering efficiency. Compared to traditional 3D representations, radiance field representations offer superior photorealistic rendering quality, therefore, this study focuses on camera manipulation upon mainstream radiance field representations such as Ne RF or 3DGS. 3D Scene Understanding. Early methods for 3D semantic understanding [32, 67, 79, 15] primarily focused on the closed-set segmentation of point clouds or voxels. Ne RF s capability to integrate information from multiple viewpoints has spurred its application in 3D semantic segmentation [88, 20, 46, 53, 65, 26, 60, 31, 48, 49, 21]. Among these, [40, 37, 12] combine image embeddings from effective 2D image feature extractors [41, 11, 59, 39] to achieve language-guided object localization, segmentation, and editing. [21] proposes semantic anisotropic Gaussians to simultaneously estimate geometry, appearance, and semantics in a single feed-forward pass. Another line of research integrates 3D with language models for tasks such as 3D question answering [3], localization [14, 57, 81], and captioning [16]. Additionally, [29, 30, 87] propose 3D foundation models to handle various perception, reasoning, and action tasks in 3D environments. However, the AI-assisted operation of cameras within 3D scenes remains an unexplored area. Trajectory Control and Prediction. Controlling the trajectories of objects and cameras is crucial to advance current video generation approaches for practical usage. TC4D [4] incorporates trajectory control for 4D scene generation with multiple dynamic objects. Direct-a-Video [82], Motion Ctrl [75], Quantization Translation Focal Length Gently push the camera forward while keeping it rolling. Text Tokens Text Tokenizer Trajectory Tokenizer Transformer Decoder Output Tokens Paired Trajectory-text Anchor Selector A close-up image of the LEGO Technic bulldozer with a kitchen backdrop. Input Images & Cameras Anchor Prompt 3D Reconstruction Anchor Refinement Rendered Anchor Image Cine GPT Anchor Determination Figure 3: (a) Cine GPT. We quantize camera trajectories to sequences of tokens and adopt a GPTbased architecture to generate the tokens autoregressively. Learning trajectory and language jointly, Cine GPT is capable of text-conditioned trajectory generation. (b) Anchor Determination. Given a prompt describing the image rendered from an anchor point, the anchor selector chooses the best matching input image. An anchor refinement procedure further fine-tunes the anchor position. and Camera Ctrl [28] manage camera pose during video generation; however, they are either limited to basic types or necessitate fine-tuning of the video diffusion model. Moreover, these approaches require user-provided trajectories, whereas we, for the first time, generate camera trajectories conditioned on text. Figure 2 provides an overview of our method s pipeline. Chat Cam analyzes the user s camera operation instruction and devises a plan to generate a trajectory using our proposed Cine GPT and Anchor Determinator. Finally, an AI agent utilizes the outputs from these tools to compose the complete trajectory. 3.1 Text-Conditioned Trajectory Generation To enable text-conditioned trajectory generation, we collect a text-trajectory dataset and introduce Cine GPT, a GPT-based autoregressive model integrating language and camera trajectories. Illustrated in Figure 3 (a), our method quantizes camera trajectories into a sequence of trajectory tokens using a trajectory tokenizer. Subsequently, a multi-modal transformer decoder is employed to convert input tokens into output tokens. Upon training, our model adeptly generates token sequences based on user-provided text prompts. These sequences are then de-quantized to reconstruct the camera trajectory. Camera Trajectory Parameterization. For each single frame, our camera parameters include rotation R R3 3, translation t R3, and intrinsic parameters K R3 3. We further convert the rotation matrix R into the S2 S2 space [89] to facilitate computational efficiency and simplify the optimization process. The total M-frame camera trajectory is formulated as: c1:M = {ci}M i=1 = {(Ri, ti, Ki)}M i=1. (1) To additionally model the velocity of camera movement, we introduce a global parameter t representing the total duration. Consequently, the instantaneous velocity of each frame can be approximated by the relative translation and rotation to the previnous frame over unit time. Text-Trajectory Dataset. Given the scarcity of readily available data on camera operations, we manually constructed approximately 1000 camera trajectories using Blender [18]. These trajectories encompass a diverse range of movements, including various combinations of translations, rotations, focal lengths, and velocities. Each trajectory is accompanied by a human language description detailing the corresponding movements. This dataset spans various scenarios, capturing both simple pan-tilt-zoom motions and more complex trajectories mimicking real-world scenarios. Trajectory Tokenizer. We leverage a trajectory tokenizer based on the Vector Quantized Variational Autoencoders (VQ-VAE) architecture [72] to represent camera trajectories as discrete tokens. Our trajectory tokenizer consists of an encoder E and a decoder D. Given an M-frame camera trajectory c1:M = {ci}M i=1, the encoder E encodes it into L trajectory tokens z1:L = {zi}L i=1, where L = M/l and l is the temporal downsampling rate. The decoder D then decodes z1:L back into the trajectory ˆc1:M = {ˆci}M i=1. Specifically, the encoder E first encodes frame-wise camera parameters c1:M into a latent vector ˆz1:L = E(c1:M), by performing 1D convolutions along the time dimension. We then transform ˆz1:L into a collection of codebook entries z through discrete quantization. The learnable codebook Z = {zi}K i=1 consists of K latent embedding vectors, each with dimension d. The quantization process Q( ) replaces each row vector with its nearest codebook entry, as follows: zi = Q(ˆzi) = arg min zk Z ||ˆzi zk||2 2, (2) where || ||2 denotes the Euclidean distance. After quantization, the decoder projects z1:L back to the trajectory space as the reconstructed trajectory ˆc1:M = D(z1:L). In addition to the reconstruction loss, we adopt embedding loss and commitment loss similar to those proposed in [85] to train our trajectory tokenizer. With a trained trajectory tokenizer, a camera trajectory c1:M can be mapped to a sequence of trajectory tokens z1:L, facilitating the joint representation of camera trajectory and natural language for text-conditioned trajectory generation. Cross-Modal Transformer. We utilize a cross-modal transformer decoder to generate output tokens from input tokens, which may consist of text tokens, trajectory tokens, or a combination of both. These output tokens are subsequently converted into the target space. To train our decoder-only transformer, we denote our source tokens as Xs = {xi s}Ns i=1 and target tokens as Xt = {xi t}Nt i=1. We feed source tokens into it to predict the probability distribution of the next potential token at each step pθ(xt|xs) = Q i pθ(xi t|x