# dreamhuman_animatable_3d_avatars_from_text__d31f2125.pdf Dream Human: Animatable 3D Avatars from Text Nikos Kolotouros Thiemo Alldieck Andrei Zanfir Eduard Gabriel Bazavan Mihai Fieraru Cristian Sminchisescu Google Research {kolotouros,alldieck,andreiz,egbazavan,fieraru,sminchisescu}@google.com We present Dream Human, a method to generate realistic animatable 3D human avatar models solely from textual descriptions. Recent text-to-3D methods have made considerable strides in generation, but are still lacking in important aspects. Control and often spatial resolution remain limited, existing methods produce fixed rather than animated 3D human models, and anthropometric consistency for complex structures like people remains a challenge. Dream Human connects large text-to-image synthesis models, neural radiance fields, and statistical human body models in a novel modeling and optimization framework. This makes it possible to generate dynamic 3D human avatars with high-quality textures and learned, instance-specific, surface deformations. We demonstrate that our method is capable to generate a wide variety of animatable, realistic 3D human models from text. Our 3D models have diverse appearance, clothing, skin tones and body shapes, and significantly outperform both generic text-to-3D approaches and previous text-based 3D avatar generators in visual fidelity. 1 Introduction The remarkable progress in Large Language Models [46, 8] has sparked considerable interest in generating a wide variety of media modalities from text. There has been significant progress in text-to-image [49, 50, 52, 67, 10, 34], text-to-speech [37, 41], text-to-music [2, 19] and text-to-3D [22, 43] generation, to name a few. Key to the success of some of the popular generative image methods conditioned on text has been diffusion models [52, 50, 55]. Recent works have shown these text-to-image models can be combined with differentiable neural 3D scene representations [5] and optimized to generate realistic 3D models solely from textual descriptions [22, 43]. Controllable generation of photorealistic 3D human models has been in the focus of the research community for a long time. This is also the goal of our work; we want to generate realistic, animatable 3D humans given only textual descriptions. Our method goes beyond static text-to-3D generation methods, because we learn a dynamic, articulated 3D model that can be placed in different poses, without additional training or fine-tuning. We capitalize on the recent progress in text-to3D generation [43], neural radiance fields [31, 5] and human body modelling [64, 3] to produce 3D human models with realistic appearance and high-quality geometry. We achieve this without using any supervised text-to-3D data, or any image conditioning. We generate photorealistic and animatable 3d human models by relying only on text, as can be seen in Figure 1 and Figure 2. As impressive as general-purpose 3D generation methods [43] are, we argue these are suboptimal for 3D human synthesis, due to limited control over generation which often results in undesirable visual artifacts such as unrealistic body proportions, missing limbs, or the wrong number of fingers. Such inconsistencies can be partially attributed to known problems of text-to-image networks, but become even more apparent when considering the arguably more difficult problem of 3D generation. Besides enabling animation capabilities, we show that geometric and kinematic human priors can resolve anthropometric consistency problems in an effective way. Our proposed method, coined 37th Conference on Neural Information Processing Systems (Neur IPS 2023). A man with dreadlocks A blonde woman wearing yoga pants Figure 1: Example of 3D models synthesized and posed by our method. Dream Human can produce an animatable 3D avatar given only a textual description of a human s appearance. At test time, our avatar can be reposed based on a set of 3D poses or a motion, without additional refinement. Dream Human, can become a powerful tool for professional artists and 3D animators and can automate complex parts of the design process, with potentially transformative effects in industries such as gaming, special effects, as well as film and content creation. Our main contributions are: We present a novel method to generate 3D human models that can be placed in a variety of poses, with realistic clothing deformations, given only a single textual description, and by training without any supervised text-to-3D data. Our models incorporate 3D human body priors that are necessary for regularizing the generation and re-posing of the resulting avatar, by using multiple losses to ensure the quality of human structure, appearance, and deformation. We improve the quality of the generation by means of semantic zooming with refining prompts to add detail in perceptually important body regions, such as the face and the hands. 2 Related Work There is considerable work related to diffusion models [58] and their applications to image generation [17, 35, 11, 52, 50, 55, 54] or image editing [24, 53, 16, 32]. Our focus is on text-to-3D [22, 43, 47] and more specifically on realistic 3D human generation conditioned on text prompts. In the following subsections we revisit some of the relevant work related to our goals. Text-to-3D generation. CLIP-Forge [56] combines CLIP [45] text-image embeddings with a learned 3D shape prior to generate 3D objects without any labeled text-to-3D pairs. Dream Fields [22] optimizes a Ne RF model given a text prompt using guidance from CLIP [45]. CLIP-Mesh [25] also uses CLIP, but substitutes Ne RF with meshes as its underlying 3D representation. Dream Fusion [43] builds on top of Dream Fields and uses supervision from a diffusion-based text-to-image-model [54]. Latent-Ne RF [30] uses a similar strategy with Dream Fusion, but optimizes a Ne RF that operates in the space of a Latent Diffusion model [52]. TEXTure [51] takes as input both a text prompt and a target mesh and optimizes the texture map to agree with the input prompt. Magic3D [28] uses a 2-stage strategy that combines Neural Radiance Fields with meshes for high resolution 3D generation. Unlike our method, all mentioned works produce a static 3D scene given a text prompt. When queried with human related prompts, results often exhibit artifacts like missing face details, unrealistic geometric proportions, partial body generation, or incorrect number of body parts like legs or fingers. We generate accurate and anthropomorphically consistent results by incorporating 3D human priors in the loop. Text-to-3D human generation. Several methods [40, 60, 4, 26, 15] learn to generate 3D human motions from text by leveraging text-to-Mo Cap datasets. Motion CLIP [59] learns to generate 3D human motions without using any paired text-to-motion data by leveraging CLIP as supervision. However, all these methods output 3D human motions in the form of 3D coordinates or human body model parameters [29] and do not have the capability to generate photorealistic results. Avatar CLIP [18] learns a Ne RF in the rest pose of SMPL [29] which is then converted back to a mesh using A Buddhist monk An Asian man wearing a navy suit A woman wearing a short jean skirt and a cropped top A woman wearing a wedding dress A man with blond hair wearing a brown leather jacket A young man wearing a turtleneck A pregnant person of color A thin Marathon runner A man wearing a Christmas sweater A senior Black person wearing a polo shirt A Karate master wearing a black belt A bodybuilder wearing a tanktop A plus-size model wearing pyjamas A chef dressed in white A Black female surgeon An Indian bride in a traditional dress A woman wearing ski clothes A Black woman dressed in gym clothes A Spanish flamenco dancer A person in a diving suit A Black person in a military uniform A man wearing a bomber jacket A track and field athlete A person dressed at the Venice Carnival A man wearing a hoodie Figure 2: 3D human avatars generated using our method given text prompts. We render each example in a random pose from two viewpoints, along with corresponding surface normal maps. Figure 3: Overview of Dream Human. Given a text prompt, such as a woman wearing a dress, we generate a realistic, animatable 3D avatar whose appearance and body shape match the textual description. A key component in our pipeline is a deformable and pose-conditioned Ne RF model learned and constrained using im GHUM [3], an implicit statistical 3D human pose and shape model. At each training step, we synthesize our avatar based on randomly sampled poses and render it from random viewpoints. The optimisation of the avatar structure is guided by the Score Distillation Sampling loss [43] powered by a text-to-image generation model [54]. We rely on im GHUM [3] to add pose control and inject anthropomorphic priors in the avatar optimisation process. We also use several other normal, mask and orientation-based losses in order to ensure coherent synthesis. Ne RF, body shape, and spherical harmonics illumination parameters (in red) are optimised. marching cubes. However the reposing procedure depends on fixed skinning weights that limit the overall realism of the animation. In contrast, our method learns per-instance pose-specific geometric deformations that result in significantly more realistic clothing appearance. Concurrent work Avatar Craft [23] produces avatars with geometry very much tied to the underlying SMPL [29] model and thus cannot model loose-fitting clothing. Another concurrent work, Dream Avatar [9], suffers from the fact that it to be retrained every time for a new pose, which makes it computationally prohibitive to repose. Deformable Neural Radiance Fields. Several methods attempt to learn Deformable Ne RFs to model dynamic content [38, 44, 61, 39, 57]. There has also been work on representing articulated human bodies [65, 36, 63, 69, 20, 66, 27]. The method more closely related to ours is H-Ne RF [65], which combines implicit human body models with Ne RFs. Compared to H-Ne RF, our method uses a simpler approach where we enforce consistency directly in 3D and not via renderings of two different density fields. Also, while H-Ne RF that uses videos for supervision, our only input is text, and we use are not constrained by the poses and viewpoints present on the video. Thus our method can generalize better in a variety of different poses and camera viewpoints. 3 Methodology 3.1 Architecture We rely on Neural Radiance Fields (Ne RF) [31] to represent our 3D scene as a continuous function of its spatial coordinates [33]. We use a multi-layer perceptron (MLP) that maps each spatial point x R3 to a tuple (c, τ) of RGB color and density values. To render a scene using Ne RF, one needs to cast rays from the camera center passing through the image pixels and then compute the expected color C along each ray. In practice, this is done by sampling points xi on the ray and then approximating the rendering integral [5] i wici, wi = αi Y j