# headsculpt_crafting_3d_head_avatars_with_text__09949095.pdf

Head Sculpt: Crafting 3D Head Avatars with Text

Xiao Han1,4 Yukang Cao2 Kai Han2 Xiatian Zhu1,5

Jiankang Deng3 Yi-Zhe Song1,4 Tao Xiang1,4 Kwan-Yee K. Wong2

1University of Surrey 2The University of Hong Kong 3Imperial College London 4i Fly Tek-Surrey Joint Research Centre on AI 5Surrey Institute for People-Centred AI

Recently, text-guided 3D generative methods have made remarkable advancements in producing high-quality textures and geometry, capitalizing on the proliferation of large vision-language and image diffusion models. However, existing methods still struggle to create high-ﬁdelity 3D head avatars in two aspects: (1) They rely mostly on a pre-trained text-to-image diffusion model whilst missing the necessary 3D awareness and head priors. This makes them prone to inconsistency and geometric distortions in the generated avatars. (2) They fall short in ﬁne-grained editing. This is primarily due to the inherited limitations from the pre-trained 2D image diffusion models, which become more pronounced when it comes to 3D head avatars. In this work, we address these challenges by introducing a versatile coarse-to-ﬁne pipeline dubbed Head Sculpt for crafting (i.e., generating and editing) 3D head avatars from textual prompts. Speciﬁcally, we ﬁrst equip the diffusion model with 3D awareness by leveraging landmark-based control and a learned textual embedding representing the back view appearance of heads, enabling 3D-consistent head avatar generations. We further propose a novel identity-aware editing score distillation strategy to optimize a textured mesh with a high-resolution differentiable rendering technique. This enables identity preservation while following the editing instruction. We showcase Head Sculpt s superior ﬁdelity and editing capabilities through comprehensive experiments and comparisons with existing methods.

1 Introduction

Modeling 3D head avatars underpins a wide range of emerging applications (e.g., digital telepresence, game character creation, and AR/VR). Historically, the creation of intricate and detailed 3D head avatars demanded considerable time and expertise in art and engineering. With the advent of deep learning, existing works [87, 28, 33, 72, 8, 38, 15] have shown promising results on the reconstruction of 3D human heads from monocular images or videos. However, these methods remain restricted to head appearance contained in their training data which is often limited in size, resulting in the inability to generalize to new appearance beyond the training data. This constraint calls for the need of more ﬂexible and generalizable methods for 3D head modeling.

Recently, vision-language models (e.g., CLIP [55]) and diffusion models (e.g., Stable Diffusion [69, 61, 59]) have attracted increasing interest. These progresses have led to the emergence of text-to-3D generative models [34, 62, 44, 27] which create 3D content in a self-supervised manner. Notably, Dream Fusion [54] introduces a score distillation sampling (SDS) strategy that leverages a pre-trained image diffusion model to compute the noise-level loss from the textual description, unlocking the potential to optimize differentiable 3D scenes (e.g., neural radiance ﬁeld [45], tetrahedral mesh [66], texture [58, 9], or point cloud [50]) with 2D diffusion prior only. Subsequent research efforts [43, 6,

Equal contributions Corresponding authors Webpage: https://brandonhan.uk/Head Sculpt

37th Conference on Neural Information Processing Systems (Neur IPS 2023).

Iron Man Terracotta Army Napoleon Bonaparte

Obama with a baseball cap

make him like the Joker

turn him into Pixar style

Simpson in the Simpsons

Kratos in God of War Taylor Swift put her a masquerade mask make her a sculpture

Black Panther in Marvel

I am Groot Vincent van Gogh Audrey Hepburn make her color-restored make her a claymation

Two-face in DC Naruto Uzumaki Leo Tolstoy Geralt in The Witcher make him smiling turn him into Minecraft

Doctor Strange Hulk

a boy with facial painting

Caesar in Rise of the Planet of the Apes

as a swimmer with a goggle

make it carved out of wood

Figure 1: Examples of generation and editing results obtained using the proposed Head Sculpt. It enables the creation and ﬁne-grained editing of high-quality head avatars, featuring intricate geometry and texture, for any type of head avatar using simple descriptions or instructions. Symbols indicate the following prompt preﬁxes: a head of [text] and a DSLR portrait of [text] .

The captions in gray are the prompt sufﬁxes while the blue ones are the editing instructions.

65, 79, 42, 75, 40, 56, 76] improve and extend Dream Fusion from various perspectives (e.g., higher resolution [39] and better geometry [10]).

Considering the ﬂexibility and versatility of natural languages, one might think that these SDS-based text-to-3D generative methods would be sufﬁcient for generating diverse 3D avatars. However, it is noted that existing methods have two major drawbacks (see Fig. 6): (1) Inconsistency and geometric distortions: The 2D diffusion models used in these methods lack 3D awareness particularly regarding camera pose; without any remedy, existing text-to-3D methods inherited this limitation, leading to the multi-face Janus problem in the generated head avatars. (2) Fine-grained editing limitations: Although previous methods propose to edit 3D models by naively ﬁne-tuning trained models with modiﬁed prompts [54, 39], we ﬁnd that this approach is prone to biased outcomes, such as identity loss or inadequate editing. This problem arises from two causes: (a) inherent bias in prompt-based editing in image diffusion models, and (b) challenges with inconsistent gradient back-propagation at separate iterations when using SDS calculated from a vanilla image diffusion model.

In this paper, we introduce a new head-avatar-focused text-to-3D method, dubbed Head Sculpt, that supports high-ﬁdelity generation and ﬁne-grained editing. Our method comprises two novel

components: (1) Prior-driven score distillation: We ﬁrst arm the pre-trained image diffusion model with 3D awareness by integrating a landmark-based Control Net [84]. Speciﬁcally, we adopt the parametric 3D head model, FLAME [38], as a prior to obtain a 2D landmark map [41, 31], which serves as an additional condition for the diffusion model, ensuring the consistency of generated head avatars across different views. Further, to remedy the front-view bias in the pre-trained diffusion model, we utilize an improved view-dependent prompt through textual inversion [17], by learning a specialized <back-view> token to emphasize back views of heads and capture their unique visual details. (2) Identity-aware editing score distillation (IESD): To address the challenges of ﬁne-grained editing for head avatars, we introduce a novel method called IESD. It blends two scores, one for editing and the other for identity preservation, both predicted by a Control Net-based implementation of Instruct Pix2Pix [5]. This approach maintains a controlled editing direction that respects both the original identity and the editing instructions. To further improve the ﬁdelity of our method, we integrate these two novel components into a coarse-to-ﬁne pipeline [39], utilizing Ne RF [48] as the low-resolution coarse model and DMTET [66] as the high-resolution ﬁne model. As demonstrated in Fig. 1, our method can generate high-ﬁdelity human-like and non-human-like head avatars while enabling ﬁne-grained editing, including local changes, shape/texture modiﬁcations, and style transfers.

2 Related work

Text-to-2D generation. In recent years, groundbreaking vision-language technologies such as CLIP [55] and diffusion models [25, 13, 59, 68] have led to signiﬁcant advancements in text-to-2D content generation [61, 57, 1, 69, 70]. Trained on extensive 2D multimodal datasets [63, 64], they are empowered with the capability to dream from the prompt. Follow-up works endeavor to efﬁciently control the generated results [84, 85, 47], extend the diffusion model to video sequence [67, 3], accomplish image or video editing [23, 32, 81, 5, 77, 14, 22], enhance the performance for personalized subjects [60, 17], etc. Although signiﬁcant progress has been made in generating 2D content from text, carefully crafting the prompt is crucial, and obtaining the desired outcome often requires multiple attempts. The inherent randomness remains a challenge, especially for editing tasks.

Text-to-3D generation. Advancements in text-to-2D generation have paved the way for text-to-3D techniques. Early efforts [82, 27, 44, 62, 34, 29, 11] propose to optimize the 3D neural radiance ﬁeld (Ne RF) or vertex-based meshes by employing the CLIP language model. However, these models encounter difﬁculties in generating expressive 3D content, primarily because of the limitations of CLIP in comprehending natural language. Fortunately, the development of image diffusion models [69, 1] has led to the emergence of Dream Fusion [54]. It proposes Score Distillation Sampling (SDS) based on a pre-trained 2D diffusion prior [61], showcasing promising generation results. Subsequent works [37] have endeavored to improve Dream Fusion from various aspects: Magic3D [39] proposes a coarse-to-ﬁne pipeline for high-resolution generations; Latent-Ne RF [43] includes shape guidance for more robust generation on the latent space [59]; Dream Avatar [6] leverages SMPL [4] to generate 3D human full-body avatars under controllable shapes and poses; Guide3D [7] explores the usage of multi-view generated images to create 3D human avatars; Fantasia3D [10] disentangles the geometry and texture training with DMTET [66] and PBR texture [49] as their 3D representation; 3DFuse [65] integrates depth control and semantic code sampling to stabilize the generation process. Despite notable progress, current text-to-3D generative models still face challenges in producing view-consistent 3D content, especially for intricate head avatars. This is primarily due to the absence of 3D awareness in text-to-2D diffusion models. Additionally, to the best of our knowledge, there is currently no approach that speciﬁcally focuses on editing the generated 3D content, especially addressing the intricate ﬁne-grained editing needs of head avatars.

3D head modeling and creation. Statistical mesh-based models, such as FLAME [38, 15], enable the reconstruction of 3D head models from images. However, they struggle to capture ﬁne details like hair and wrinkles. To overcome this issue, recent approaches [8, 71, 72, 51] employ Generative Adversarial Networks (GANs) [46, 20, 30] to train 3D-aware networks on 2D head datasets and produce 3Dconsistent images through latent code manipulation. Furthermore, neural implicit methods [87, 16, 28, 88] introduce implicit and subject-oriented head models based on neural rendering ﬁelds [45, 48, 2]. Recently, text-to-3D generative methods have gained traction, generating high-quality 3D head avatars from natural language using vision-language models [55, 69]. Typically, T2P [85] predicts bone-driven parameters of head avatars via a game engine under the CLIP guidance [55]. Rodin [80] proposes a roll-out diffusion network to perform 3D-aware diffusion. Dream Face [83] employs a

Description / Instruction, front view / <back-view> /

Latent Diffusion Prior (Stable Diffusion) Encoder

Control Net-based Instruct Pix2Pix (Editing Only)

Landmark Control Net

Enhanced viewdependent prompt

Landmark Map

Rendered Image

Reference Image

(c) Prior-driven Score Distillation (PSD)

a DSLR portrait of Saul Goodman turn him into a clown

High-quality 3D Mesh Model

PSD Render (low-res)

FLAME-based Ne RF

Projected Landmark

Identity-aware Editing Score Distillation

3D Mesh (DMTET)

PSD Render (high-res)

Projected Landmark

(b) Fine Stage

3D Mesh (DMTET)

Render (high-res)

Projected Landmark

Trained Coarse Ne RF

Render (low-res)

(a) Coarse Stage

Figure 2: Overall architecture of Head Sculpt. We craft high-resolution 3D head avatars in a coarseto-ﬁne manner. (a) We optimize neural ﬁeld representations for the coarse model. (b) We reﬁne or edit the model using the extracted 3D mesh and apply identity-aware editing score distillation if editing is the target. (c) The core of our pipeline is the prior-driven score distillation, which incorporates landmark control, enhanced view-dependent prompts, and an Instruct Pix2Pix branch.

selection strategy in the CLIP embedding space to generate coarse geometry and uses SDS [54] to optimize UV-texture. Despite producing promising results, all these methods require a large amount of data for supervised training and struggle to generalize well to non-human-like avatars. In contrast, our approach relies solely on pre-trained text-to-2D models, generalizes well to out-of-domain avatars, and is capable of performing ﬁne-grained editing tasks.

3 Methodology

Head Sculpt is a 3D-aware text-to-3D approach that utilizes a pre-trained text-to-2D Stable Diffusion model [69, 59] to generate high-resolution head avatars and perform ﬁne-grained editing tasks. As illustrated in Fig. 2, the generation pipeline has two stages: coarse generation via the neural radiance ﬁeld (Ne RF) [48] and reﬁnement/editing using tetrahedron mesh (DMTET) [66]. Next, we will ﬁrst introduce the preliminaries that form the basis of our method in Sec. 3.1. We will then discuss the key components of our approach in Sec. 3.2 and Sec, 3.3, including (1) the prior-driven score distillation process via landmark-based Control Net [84] and textual inversion [17], and (2) identity-aware editing score distillation accomplished in the ﬁne stage using the Control Net-based Instruct Pix2Pix [5].

3.1 Preliminaries

Score distillation sampling. Recently, Dream Fusion [54] proposed score distillation sampling (SDS) to self-optimize a text-consistent neural radiance ﬁeld (Ne RF) based a the pre-trained text-to-2D diffusion model [61]. Due to the unavailability of the Imagen model [61] used by Dream Fusion, we employ the latent diffusion model in [59] instead. Speciﬁcally, given a latent feature z encoded from an image x, SDS introduces random noise ϵ to z to create a noisy latent variable zt and then uses a pre-trained denoising function ϵϕ (zt; y, t) to predict the added noise. The SDS loss is deﬁned as the difference between predicted and added noise and its gradient is given by

θLSDS(ϕ, g(θ)) = Et,ϵ N (0,1)

w(t) (ϵϕ (zt; y, t) ϵ) z

where y is the text embedding, w(t) weights the loss from noise level t. With the expressive text-to2D diffusion model and self-supervised SDS loss, we can back-propagate the gradients to optimize an implicit 3D scene g(θ), eliminating the need for an expensive 3D dataset.

3D scene optimization. Head Sculpt explores the potential of two different 3D differentiable representations as the optimization basis for crafting 3D head avatars. Speciﬁcally, we employ Ne RF [48] in the coarse stage due to its greater ﬂexibility in geometry deformation, while utilizing DMTET [66] in the ﬁne stage for efﬁcient high-resolution optimization.

(1) 3D prior-based Ne RF. Dream Avatar [6] recently proposed a density-residual setup to enhance the robustness and controllability of the generated 3D Ne RF. Given a point x inside the 3D volume, we can derive its density and color value based on a prior-based density ﬁeld σ:

F(x, σ) = Fθ(γ(x)) + ( σ(x), 0) 7 (σ, c), (2)

where γ( ) denotes a hash-grid frequency encoder [48], and σ and c are the density and RGB color respectively. We can derive σ from the signed distance d(x) of a given 3D shape prior (e.g., a canonical FLAME model [38] by default in our implementation):

σ(x) = max 0, softplus 1(τ(x)) , τ(x) = 1

a sigmoid( d(x)/a), where a = 0.005. (3)

To obtain a 2D RGB image from the implicit volume deﬁned above, we employ a volume rendering technique that involves casting a ray r from the 2D pixel location into the 3D scene, sampling points µi along the ray, and calculating their density and color value using F in Eq. (2):

i Wici, Wi = αi Y

j<i (1 αj), αi = 1 e( σi||µi µi+1||). (4)

(2) DMTET. It discretizes a deformable tetrahedral grid (VT , T), where VT denotes the vertices within grid T [19, 66], to model the 3D space. Every vertex vi VT R3 possesses a signed distance value si R, along with a position offset vi R3 of the vertex relative to its initial canonical coordinates. Subsequently, the underlying mesh can be extracted based on si with the differentiable marching tetrahedra algorithm. In addition to the geometry, we adopt the Magic3D approach [39] to construct a neural color ﬁeld. This involves re-utilizing the MLP trained in the coarse Ne RF stage to predict the RGB color value for each 3D point. During optimization, we render this textured surface mesh into high-resolution images using a differentiable rasterizer [36, 49].

3.2 3D-Prior-driven score distillation

Existing text-to-3D methods with SDS [54] assume that maximizing the likelihood of images rendered from various viewpoints of a scene model g( ) is equivalent to maximizing the overall likelihood of g( ). This assumption can result in inconsistencies and geometric distortions [54, 65]. A notable issue is the Janus problem characterized by multiple faces on a single object (see Fig. 6). There are two possible causes: (1) the randomness of the diffusion model which can cause inconsistencies among different views, and (2) the lack of 3D awareness in controlling the generation process, causing the model to struggle in determining the front view, back view, etc. To address these issues in generating head avatars, we integrate 3D head priors into the diffusion model.

Landmark-based Control Net. In Section 3.1, we explain our adoption of FLAME [38] as the density guidance for our Ne RF. Nevertheless, this guidance by itself is insufﬁcient to have a direct impact on the SDS loss. What is missing is a link between the Ne RF and the diffusion model, incorporating the same head priors. Such a link is key to improving the view consistency of the generated head avatars. To achieve this objective, as illustrated in Fig. 2, we propose the incorporation of 2D landmark maps as an additional condition for the diffusion model using Control Net [84]. Speciﬁcally, we employ a Control Net C trained on a large-scale 2D face dataset [86, 12], using facial landmarks rendered from Media Pipe [41, 31] as ground-truth data. When given a randomly sampled camera pose π, we ﬁrst project the vertices of the FLAME model onto the image. Following that, we select and render some of these vertices into a landmark map Pπ based on some predeﬁned vertex indexes. The landmark map will be fed into Control Net and its output features are added to the intermediate features within the diffusion U-Net. The gradient of our SDS loss can be re-written as

θLSDS(ϕ, g(θ)) = Et,ϵ N (0,1),π

w(t) (ϵϕ (zt; y, t, C(Pπ)) ϵ) z

Enhanced view-dependent prompt via textual inversion. Although the landmark-based Control Net can inject 3D awareness into the pre-trained diffusion models, it struggles to maintain back view head consistency. This is expected as the 2D image dataset used for training mostly contains only front or side face views. Consequently, when applied directly to back views, the model introduces ambiguity as front and back 3D landmark views can appear similar, as shown in Fig. 8. To address this issue, we propose a simple yet effective method. Our method is inspired by previous works [54, 65, 39] which found it beneﬁcial to append view-dependent text (e.g., front view , side view or back

view ) to the provided input text based on the azimuth angle of the randomly sampled camera. We extend this idea by learning a special token <back-view> to replace the plain text back view in order to emphasize the rear appearance of heads. This is based on the assumption that a pre-trained Stable Diffusion does has the ability to imagine the back view of a head - it has seen some during training. The main problem is that a generic text embedding of back view is inadequate in telling the model what appearance it entails. A better embedding for back view is thus required. To this end, we ﬁrst randomly download 34 images of the back view of human heads, without revealing any personal identities, to construct a tiny dataset D, and then we optimize the special token v (i.e., <back-view>) to better ﬁt the collected images, similar to the textual inversion [17]:

v = arg min v Et,ϵ N (0,1),z D h ϵ ϵϕ (zt; v, t) 2 2 i , (6)

which is achieved by employing the same training scheme as the original diffusion model, while keeping ϵϕ ﬁxed. This constitutes a reconstruction task, which we anticipate will encourage the learned embedding to capture the ﬁne visual details of the back views of human heads. Notably, as we do not update the weights of ϵϕ, it stays compatible with the landmark-based Control Net.

3.3 Identity-aware editing score distillation

After generating avatars, editing them to fulﬁll particular requirements poses an additional challenge. Previous works [54, 39] have shown promising editing results by ﬁne-tuning a trained scene model with a new target prompt. However, when applied to head avatars, these methods often suffer from identity loss or inadequate appearance modiﬁcations (see Fig. 10). This problem stems from the inherent constraint of the SDS loss, where the 3D models often sacriﬁce prominent features to preserve view consistency. Substituting Stable Diffusion with Instruct Pix2Pix [5, 21] might seem like a simple solution, but it also faces difﬁculties in maintaining facial identity during editing based only on instructions, as it lacks a well-deﬁned anchor point.

To this end, we propose identity-aware editing score distillation (IESD) to regulate the editing direction by blending two predicted scores, i.e., one for editing instruction and another for the original description. Rather than using the original Instruct Pix2Pix [5], we employ a Control Net-based Instruct Pix2Pix I [84] trained on the same dataset, ensuring compatibility with our landmark-based Control Net C and the learned <back-view> token. Formally, given an initial textual prompt y describing the avatar to be edited and an editing instruction ˆy, we ﬁrst input them separately into the same diffusion model equipped with two Control Nets, I and C. This allows us to obtain two predicted noises, which are then combined using a predeﬁned hyper-parameter ωe like classiﬁer-free diffusion guidance (CFG) [26]:

θLIESD(ϕ, g(θ)) = Et,ϵ N (0,1),π

w(t) ˆϵϕ (zt; y, ˆy, t, C(Pπ), I(Mπ)) | {z } ϵ z

ωeϵϕ (zt; ˆy, t, C(Pπ), I(Mπ)) + (1 ωe)ϵϕ (zt; y, t, C(Pπ), I(Mπ)) (7)

where Pπ and Mπ represent the 2D landmark maps and the reference images rendered in the coarse stage, both being obtained under the sampled camera pose π. The parameter ωe governs a trade-off between the original appearance and the desired editing, which defaults to 0.6 in our experiments.

4 Experiments

We will now assess the efﬁcacy of our Head Sculpt across different scenarios, while also conducting a comparative analysis against state-of-the-art text-to-3D generation pipelines.

Implementation details. Head Sculpt builds upon Stable-Dream Fusion [73] and Huggingface Diffusers [78, 53]. We utilize version 1.5 of Stable Diffusion [69] and version 1.1 of Control Net [84, 12] in our implementation. In the coarse stage, we optimize our 3D model at 64 64 grid resolution, while using 512 512 grid resolution for the ﬁne stage (reﬁnement or editing). Typically, each text prompt requires approximately 7, 000 iterations for the coarse stage and 5, 000 iterations for the ﬁne stage. It takes around 1 hour for each stage on a single Tesla V100 GPU with a default batch size of 4. We use Adam [35] optimizer with a ﬁxed learning rate of 0.001. Additional implementation details can be found in the supplementary material.

Shape 1 Shape 2 Shape 3

a DSLR portrait of Saul Goodman

Figure 3: Generation results with various shapes. The ﬁrst row shows three randomly sampled FLAME models, while the second row presents our generated results (incl. normals) using these FLAME models as initialization. All results are under the same text prompt.

sad surprised disgusted make him bald give him a beard give him a sunglass

Figure 4: More speciﬁc editing results. Instruction preﬁx: make his expression as [text].

Baseline methods for generation evaluation. We compare the generation results with ﬁve baselines: Dream Fusion [73], Latent-Ne RF [43], 3DFuse [65] (improved version of SJC [79]), Fantasia3D [10], and Dream Face [83]. We do not directly compare with Dream Avatar [6] as it involves deformation ﬁelds for full-body-related tasks.

Baseline methods for editing evaluation. We assess IESD s efﬁcacy for ﬁne-grained 3D head avatar editing by comparing it with various alternatives since no dedicated method exists for this: (B1) One-step optimization on the coarse stage without initialization; (B2) Initialized from the coarse stage, followed by optimization of another coarse stage with an altered description; (B3) Initialized from the coarse stage, followed by optimization of a new ﬁne stage with an altered description; (B4) Initialized from the coarse stage, followed by optimization of a new ﬁne stage with an instruction based on the vanilla Instruct Pix2Pix [5]; (B5) Ours without edit scale (i.e., ωe = 1). Notably, B2 represents the editing method proposed in Dream Fusion [54], while B3 has a similar performance as Magic3D [39], which employs a three-stage editing process (i.e., Coarse + Coarse + Fine).

4.1 Qualitative evaluations

Head avatar generation with various prompts. In Fig. 1, we show a diverse array of 3D head avatars generated by our Head Sculpt, consistently demonstrating high-quality geometry and texture across various viewpoints. Our method s versatility is emphasized by its ability to create an assortment of avatars, including humans (both celebrities and ordinary individuals) as well as non-human characters like superheroes, comic/game characters, paintings, and more.

Head avatar generation with different shapes. Head Sculpt leverages shape-guided Ne RF initialization and landmark-guided diffusion priors. This allows controlling geometry by varying the FLAME shape used for initialization. To demonstrate adjustability, Fig. 3 presents examples generated from diverse FLAME shapes. The results ﬁt closely to the shape guidance, highlighting Head Sculpt s capacity for geometric variation when provided different initial shapes.

Head avatar editing with various instructions. As illustrated in Fig. 1 and Fig. 4, Head Sculpt s adaptability is also showcased through its ability to perform ﬁne-grained editing, such as local changes (e.g., adding accessories, changing hairstyles, or altering expressions), shape and texture modiﬁcations, and style transfers.

Head avatar editing with different edit scales. In Fig. 5, we demonstrate the effectiveness of IESD with different ωe values, highlighting its ability to control editing inﬂuence on the reference identity.

Reference Avatar ωe = 0.2 ωe = 0.4 ωe = 0.6 ωe = 0.8 ωe = 1.0

Saul Goodman turn him into a clown

Figure 5: Impact of the edit scale ωe in IESD. It balances the preservation of the initial appearance and the extent of the desired editing, making the editing process more controllable and ﬂexible.

Dream Fusion* [73] Latent-Ne RF [43] 3DFuse [65] Fantasia3D* [74] Dream Face [83] Head Sculpt (Ours)

a DSLR portrait of Salvador Dalí

a head of Stormtrooper

a DSLR portrait of a female soldier, wearing a helmet

a DSLR portrait of a young black lady with short hair, wearing a headphone

Figure 6: Comparison with existing text-to-3D methods. Unlike other methods that struggle or fail to generate reasonable results, our approach consistently achieves high-quality geometry and texture, yielding superior results. *Non-ofﬁcial implementation. Generated from the online website demo.

Comparison with existing methods on generation results. We provide qualitative comparisons with existing methods in Fig. 6. We employ the same FLAME model for Latent-Ne RF [43] to compute their sketch-guided loss and for Fantasia3D [74] as the initial geometry. The following observations can be made: (1) All baselines tend to be more unstable during training than ours, often resulting in diverged training processes; (2) Latent-Ne RF occasionally produces plausible results due to its use of the shape prior, but its textures are inferior to ours since optimization occurs solely in the latent space; (3) Despite 3DFuse s depth control to mitigate the Janus problem, it still struggles to generate 3D consistent head avatars; (4) While Fantasia3D can generate a mesh-based 3D avatar, its geometry is heavily distorted, as its disentangled geometry optimization might be insufﬁcient for highly detailed head avatars; (5) Although Dream Face generates realistic human face textures, it falls short in generating (i) complete heads, (ii) intricate geometry, (iii) non-human-like appearance, and (iv) composite accessories. In comparison, our method consistently yields superior results in both geometry and texture with much better consistency for the given prompt. More comparisons can be found in the supplementary material.

Textual Consistency

Texture Quality

4 Dream Fusion Latent-Ne RF 3DFuse Fantasia3D Head Sculpt (Ours)

Figure 7: User study. Numbers are averaged over 42 responses.

Generation Dream Fusion Latent Ne RF 3DFuse Fantasia3D Ours

CLIP-R [52] 95.83 87.50 70.83 62.50 100.00 CLIP-S [24] 26.06 26.30 23.41 23.26 29.52

Editing B3 B4 B5 Ours

CLIP-DS [18] 16.62 8.76 14.03 16.84

Table 1: Objective evaluation with CLIP-based metrics. All numbers are calculated with CLIP-L/14.

Head Sculpt (Ours) w/o Landmark Ctrl Head Sculpt (Ours) w/o Textual Inversion

a head of Woody in the Toy Story a head of Walter White, wearing a bowler hat

a head of Bumblebee in Transformers a head of Mario in Mario Franchise

Figure 8: Analysis of prior-driven score distillation.

Sun Wukong Freddy Krueger

Japanese Geisha remove his nose

Figure 9: Failure cases.

4.2 Quantitative evaluations

User studies. We conducted user studies comparing with four baselines [73, 74, 65, 43]. 42 volunteers ranked them from 1 (worst) to 5 (best) individually based on three dimensions: (1) consistency with the text, (2) texture quality, and (3) geometry quality. The results, shown in Fig. 7, indicate that our method achieved the highest rank in all three aspects by large margins.

CLIP-based metrics. Following Dream Fusion [54], (1) We calculate the CLIP R-Precision (CLIPR) [52] and CLIP-Score (CLIP-S) [24] metrics, which evaluate the correlation between the generated images and the input texts, for all methods using 30 text prompts. As indicated in Tab. 1, our approach signiﬁcantly outperforms the competing methods according to both metrics. This outcome provides additional evidence for the subjective superiority observed in the user studies and qualitative results. (2) We employ the CLIP Directional Similarity (CLIP-DS) [5, 18], to evaluate the editing performance. This metric measures the alignment between changes in text captions and corresponding image modiﬁcations. Speciﬁcally, we encode a pair of images (the original and edited 3D models rendered from a speciﬁc viewpoint) along with a pair of text prompts describing the original and edited subjects, e.g., a DSLR portrait of Saul Goodman and a DSLR portrait of Saul Goodman dressed like a clown . We compare our approach against B3, B4, and B5 by evaluating 10 edited examples. The results, presented in Tab. 1, highlight the superiority of our editing framework according to this metric, indicating improved editing ﬁdelity and identity preservation compared to alternatives.

4.3 Further analysis

Effectiveness of prior-driven score distillation. In Fig. 8, we conduct ablation studies to examine the impact of the proposed landmark control and textual inversion priors in our method. We demonstrate this on the coarse stage because the reﬁnement and editing results heavily depend on this stage. The ﬁndings show that landmark control is essential for generating spatially consistent head avatars. Without it, the optimized 3D avatar faces challenges in maintaining consistent facial views, particularly for non-human-like characters. Moreover, textual inversion is shown to be another vital component in mitigating the Janus problem, speciﬁcally for the back view, as landmarks cannot exert control on the rear view. Overall, the combination of both components enables Head Sculpt to produce view-consistent avatars with high-quality geometry.

B1: One-stage B2: Coarse + Coarse B3: Coarse + Fine B4: Naive IP2P B5: Ours w/o ωe Head Sculpt (Ours)

Modiﬁed description: a DSLR portrait of +[older] Saul Goodman Instruction: make him older

Modiﬁed description: a DSLR portrait skull of Vincent van Gogh Instruction: turn his face into a skull

Figure 10: Analysis of identity-aware editing score distillation.

Effectiveness of IESD. In Fig. 10, we present two common biased editing scenarios produced by the baseline methods: insufﬁcient editing and loss of identity. With Stable Diffusion, speciﬁc terms like Saul Goodman and skull exert a more substantial inﬂuence on the text embeddings compared to other terms, such as older and Vincent van Gogh . B1, B2, and B3, all based on vanilla Stable Diffusion, inherit such bias in their generated 3D avatars. Although B4 does not show such bias, it faces two other issues: (1) the Janus problem reemerges due to incompatibility between vanilla Instruct Pix2Pix and the proposed prior-driven score distillation; (2) it struggles to maintain facial identity during editing based solely on instructions, lacking a well-deﬁned anchor point. In contrast, B5 employs Control Net-based Instruct Pix2Pix [84] with the proposed prior score distillation, resulting in more view-consistent editing. Additionally, our IESD further uses the proposed edit scale to merge two predicted scores, leading to better identity preservation and more effective editing. This approach allows our method to overcome the limitations faced by the alternative solutions, producing high-quality 3D avatars with improved ﬁne-grained editing results.

Limitations and failure cases. While setting a new state-of-the-art, we acknowledge Head Sculpt has limitations, as the failure cases in Fig. 9 demonstrate: (1) non-deformable results hinder further extensions and applications in audio or video-driven problems; (2) generated textures are highly saturated and less realistic, especially for characters with highly detailed appearances (e.g., Freddy Krueger); (3) some inherited biases from Stable Diffusion [69] still remain, such as inaccurate and stereotypical appearances of Asian characters (e.g., Sun Wukong and Japanese Geisha); and (4) limitations inherited from Instruct Pix2Pix [5], such as the inability to perform large spatial manipulations (e.g., remove his nose).

5 Conclusions

We have introduced Head Sculpt, a novel pipeline for generating high-resolution 3D human avatars and performing identity-aware editing tasks through text. We proposed to utilize a prior-driven score distillation that combines a landmark-based Control Net and view-dependent textual inversion to address the Janus problem. We also introduced identity-aware editing score distillation that preserves both the original identity information and the editing instruction. Extensive evaluations demonstrated that our Head Sculpt produces high-ﬁdelity results under various scenarios, outperforming state-ofthe-art methods signiﬁcantly.

Societal impact. The advancements in geometry and texture generation for human head avatars could be deployed in many AR/VR use cases but also raises concerns about their potential malicious use. We encourage responsible research and application, fostering open and transparent practices.

Acknowledgment. This work is partially supported by Hong Kong Research Grant Council - Early Career Scheme (Grant No. 27208022) and HKU Seed Fund for Basic Research. We also thank the anonymous reviewers for their constructive suggestions.

[1] Yogesh Balaji, Seungjun Nah, Xun Huang, Arash Vahdat, Jiaming Song, Karsten Kreis, Miika Aittala, Timo Aila, Samuli Laine, Bryan Catanzaro, et al. edifﬁ: Text-to-image diffusion models with an ensemble of expert denoisers. ar Xiv preprint ar Xiv:2211.01324, 2022. 3

[2] Jonathan T Barron, Ben Mildenhall, Dor Verbin, Pratul P Srinivasan, and Peter Hedman. Mip-nerf 360: Unbounded anti-aliased neural radiance ﬁelds. In CVPR, 2022. 3

[3] Andreas Blattmann, Robin Rombach, Huan Ling, Tim Dockhorn, Seung Wook Kim, Sanja Fidler, and Karsten Kreis. Align your latents: High-resolution video synthesis with latent diffusion models. In CVPR, 2023. 3

[4] Federica Bogo, Angjoo Kanazawa, Christoph Lassner, Peter Gehler, Javier Romero, and Michael J Black. Keep it smpl: Automatic estimation of 3d human pose and shape from a single image. In ECCV, 2016. 3

[5] Tim Brooks, Aleksander Holynski, and Alexei A Efros. Instructpix2pix: Learning to follow image editing instructions. In CVPR, 2023. 3, 4, 6, 7, 9, 10, 17

[6] Yukang Cao, Yan-Pei Cao, Kai Han, Ying Shan, and Kwan-Yee K Wong. Dreamavatar: Text-and-shape guided 3d human avatar generation via diffusion models. ar Xiv preprint ar Xiv:2304.00916, 2023. 1, 3, 5, 7

[7] Yukang Cao, Yan-Pei Cao, Kai Han, Ying Shan, and Kwan-Yee K Wong. Guide3d: Create 3d avatars from text and image guidance. ar Xiv preprint ar Xiv:2308.09705, 2023. 3

[8] Eric R. Chan, Connor Z. Lin, Matthew A. Chan, Koki Nagano, Boxiao Pan, Shalini De Mello, Orazio Gallo, Leonidas Guibas, Jonathan Tremblay, Sameh Khamis, Tero Karras, and Gordon Wetzstein. Efﬁcient geometry-aware 3D generative adversarial networks. In CVPR, 2022. 1, 3

[9] Dave Zhenyu Chen, Yawar Siddiqui, Hsin-Ying Lee, Sergey Tulyakov, and Matthias Nießner. Text2tex: Text-driven texture synthesis via diffusion models. In ICCV, 2023. 1

[10] Rui Chen, Yongwei Chen, Ningxin Jiao, and Kui Jia. Fantasia3d: Disentangling geometry and appearance for high-quality text-to-3d content creation. In ICCV, 2023. 2, 3, 7, 22

[11] Yongwei Chen, Rui Chen, Jiabao Lei, Yabin Zhang, and Kui Jia. Tango: Text-driven photorealistic and robust 3d stylization via lighting decomposition. In Neur IPS, 2023. 3

[12] Crucible AI. Control Net Media Pipe Face. https://huggingface.co/Crucible AI/ Control Net Media Pipe Face, 2023. 5, 6

[13] Prafulla Dhariwal and Alexander Nichol. Diffusion models beat gans on image synthesis. In Neur IPS, 2020. 3

[14] Patrick Esser, Johnathan Chiu, Parmida Atighehchian, Jonathan Granskog, and Anastasis Germanidis. Structure and content-guided video synthesis with diffusion models. ar Xiv preprint ar Xiv:2302.03011, 2023. 3

[15] Yao Feng, Haiwen Feng, Michael J Black, and Timo Bolkart. Learning an animatable detailed 3d face model from in-the-wild images. ACM Transactions on Graphics (TOG), 2021. 1, 3

[16] Guy Gafni, Justus Thies, Michael Zollhofer, and Matthias Nießner. Dynamic neural radiance ﬁelds for monocular 4d facial avatar reconstruction. In CVPR, 2021. 3

[17] Rinon Gal, Yuval Alaluf, Yuval Atzmon, Or Patashnik, Amit H Bermano, Gal Chechik, and Daniel Cohen-Or. An image is worth one word: Personalizing text-to-image generation using textual inversion. In ICLR, 2023. 3, 4, 6

[18] Rinon Gal, Or Patashnik, Haggai Maron, Amit H Bermano, Gal Chechik, and Daniel Cohen-Or. Stylegannada: Clip-guided domain adaptation of image generators. ACM Transactions on Graphics (TOG), 2022. 9

[19] Jun Gao, Wenzheng Chen, Tommy Xiang, Alec Jacobson, Morgan Mc Guire, and Sanja Fidler. Learning deformable tetrahedral meshes for 3d reconstruction. In Neur IPS, 2020. 5

[20] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial networks. Communications of the ACM, 2020. 3

[21] Ayaan Haque, Matthew Tancik, Alexei A Efros, Aleksander Holynski, and Angjoo Kanazawa. Instructnerf2nerf: Editing 3d scenes with instructions. In ICCV, 2023. 6

[22] Amir Hertz, Kﬁr Aberman, and Daniel Cohen-Or. Delta denoising score. ar Xiv preprint ar Xiv:2304.07090, 2023. 3

[23] Amir Hertz, Ron Mokady, Jay Tenenbaum, Kﬁr Aberman, Yael Pritch, and Daniel Cohen-Or. Prompt-toprompt image editing with cross attention control. ar Xiv preprint ar Xiv:2208.01626, 2022. 3

[24] Jack Hessel, Ari Holtzman, Maxwell Forbes, Ronan Le Bras, and Yejin Choi. Clipscore: A reference-free evaluation metric for image captioning. In EMNLP, 2021. 9

[25] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. In Neur IPS, 2020. 3

[26] Jonathan Ho and Tim Salimans. Classiﬁer-free diffusion guidance. In Neur IPS Workshop, 2021. 6

[27] Fangzhou Hong, Mingyuan Zhang, Liang Pan, Zhongang Cai, Lei Yang, and Ziwei Liu. Avatarclip: zero-shot text-driven generation and animation of 3d avatars. ACM Transactions on Graphics (TOG), 2022. 1, 3

[28] Yang Hong, Bo Peng, Haiyao Xiao, Ligang Liu, and Juyong Zhang. Headnerf: A real-time nerf-based parametric head model. In CVPR, 2022. 1, 3

[29] Ajay Jain, Ben Mildenhall, Jonathan T Barron, Pieter Abbeel, and Ben Poole. Zero-shot text-guided object generation with dream ﬁelds. In CVPR, 2022. 3

[30] Tero Karras, Samuli Laine, and Timo Aila. A style-based generator architecture for generative adversarial networks. In CVPR, 2019. 3

[31] Yury Kartynnik, Artsiom Ablavatski, Ivan Grishchenko, and Matthias Grundmann. Real-time facial surface geometry from monocular video on mobile gpus. In CVPR workshops, 2019. 3, 5

[32] Bahjat Kawar, Shiran Zada, Oran Lang, Omer Tov, Huiwen Chang, Tali Dekel, Inbar Mosseri, and Michal Irani. Imagic: Text-based real image editing with diffusion models. In CVPR, 2023. 3

[33] Taras Khakhulin, Vanessa Sklyarova, Victor Lempitsky, and Egor Zakharov. Realistic one-shot mesh-based head avatars. In ECCV, 2022. 1

[34] Nasir Mohammad Khalid, Tianhao Xie, Eugene Belilovsky, and Popa Tiberiu. Clip-mesh: Generating textured meshes from text using pretrained image-text models. In SIGGRAPH Asia, 2022. 1, 3

[35] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In ICLR, 2015. 6

[36] Samuli Laine, Janne Hellsten, Tero Karras, Yeongho Seol, Jaakko Lehtinen, and Timo Aila. Modular primitives for high-performance differentiable rendering. ACM Transactions on Graphics (TOG), 2020. 5

[37] Chenghao Li, Chaoning Zhang, Atish Waghwase, Lik-Hang Lee, Francois Rameau, Yang Yang, Sung-Ho Bae, and Choong Seon Hong. Generative ai meets 3d: A survey on text-to-3d in aigc era. ar Xiv preprint ar Xiv:2305.06131, 2023. 3

[38] Tianye Li, Timo Bolkart, Michael J Black, Hao Li, and Javier Romero. Learning a model of facial shape and expression from 4d scans. ACM Transactions on Graphics (TOG), 2017. 1, 3, 5

[39] Chen-Hsuan Lin, Jun Gao, Luming Tang, Towaki Takikawa, Xiaohui Zeng, Xun Huang, Karsten Kreis, Sanja Fidler, Ming-Yu Liu, and Tsung-Yi Lin. Magic3d: High-resolution text-to-3d content creation. In CVPR, 2023. 2, 3, 5, 6, 7, 17

[40] Ruoshi Liu, Rundi Wu, Basile Van Hoorick, Pavel Tokmakov, Sergey Zakharov, and Carl Vondrick. Zero-1-to-3: Zero-shot one image to 3d object. In ICCV, 2023. 2

[41] Camillo Lugaresi, Jiuqiang Tang, Hadon Nash, Chris Mc Clanahan, Esha Uboweja, Michael Hays, Fan Zhang, Chuo-Ling Chang, Ming Yong, Juhyun Lee, et al. Mediapipe: A framework for perceiving and processing reality. In CVPR workshops, 2019. 3, 5

[42] Luke Melas-Kyriazi, Iro Laina, Christian Rupprecht, and Andrea Vedaldi. Realfusion: 360deg reconstruction of any object from a single image. In CVPR, 2023. 2

[43] Gal Metzer, Elad Richardson, Or Patashnik, Raja Giryes, and Daniel Cohen-Or. Latent-nerf for shapeguided generation of 3d shapes and textures. In CVPR, 2023. 1, 3, 7, 8, 9, 19, 20, 21, 22

[44] Oscar Michel, Roi Bar-On, Richard Liu, Sagie Benaim, and Rana Hanocka. Text2mesh: Text-driven neural stylization for meshes. In CVPR, 2022. 1, 3

[45] Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik, Jonathan T Barron, Ravi Ramamoorthi, and Ren Ng. Nerf: Representing scenes as neural radiance ﬁelds for view synthesis. In ECCV, 2020. 1, 3

[46] Mehdi Mirza and Simon Osindero. Conditional generative adversarial nets. ar Xiv preprint ar Xiv:1411.1784, 2014. 3

[47] Chong Mou, Xintao Wang, Liangbin Xie, Jian Zhang, Zhongang Qi, Ying Shan, and Xiaohu Qie. T2iadapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models. ar Xiv preprint ar Xiv:2302.08453, 2023. 3

[48] Thomas Müller, Alex Evans, Christoph Schied, and Alexander Keller. Instant neural graphics primitives with a multiresolution hash encoding. ACM Transactions on Graphics (TOG), 2022. 3, 4, 5

[49] Jacob Munkberg, Jon Hasselgren, Tianchang Shen, Jun Gao, Wenzheng Chen, Alex Evans, Thomas Müller, and Sanja Fidler. Extracting triangular 3d models, materials, and lighting from images. In CVPR, 2022. 3, 5

[50] Alex Nichol, Heewoo Jun, Prafulla Dhariwal, Pamela Mishkin, and Mark Chen. Point-e: A system for generating 3d point clouds from complex prompts. ar Xiv preprint ar Xiv:2212.08751, 2022. 1

[51] Roy Or-El, Xuan Luo, Mengyi Shan, Eli Shechtman, Jeong Joon Park, and Ira Kemelmacher-Shlizerman. Stylesdf: High-resolution 3d-consistent image and geometry generation. In CVPR, 2022. 3

[52] Dong Huk Park, Samaneh Azadi, Xihui Liu, Trevor Darrell, and Anna Rohrbach. Benchmark for compositional text-to-image synthesis. In Neur IPS Datasets and Benchmarks Track (Round 1), 2021. 9

[53] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. Pytorch: An imperative style, high-performance deep learning library. In Neur IPS, 2019. 6

[54] Ben Poole, Ajay Jain, Jonathan T. Barron, and Ben Mildenhall. Dreamfusion: Text-to-3d using 2d diffusion. In ICLR, 2022. 1, 2, 3, 4, 5, 6, 7, 9, 17

[55] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In ICML, 2021. 1, 3

[56] Amit Raj, Srinivas Kaza, Ben Poole, Michael Niemeyer, Nataniel Ruiz, Ben Mildenhall, Shiran Zada, Kﬁr Aberman, Michael Rubinstein, Jonathan Barron, et al. Dreambooth3d: Subject-driven text-to-3d generation. ar Xiv preprint ar Xiv:2303.13508, 2023. 2

[57] Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image generation with clip latents. ar Xiv preprint ar Xiv:2204.06125, 2022. 3

[58] Elad Richardson, Gal Metzer, Yuval Alaluf, Raja Giryes, and Daniel Cohen-Or. Texture: Text-guided texturing of 3d shapes. ar Xiv preprint ar Xiv:2302.01721, 2023. 1

[59] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. In CVPR, 2022. 1, 3, 4

[60] Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch, Michael Rubinstein, and Kﬁr Aberman. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In CVPR, 2023. 3

[61] Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily Denton, Seyed Kamyar Seyed Ghasemipour, Burcu Karagol Ayan, S Sara Mahdavi, Rapha Gontijo Lopes, et al. Photorealistic text-toimage diffusion models with deep language understanding. ar Xiv preprint ar Xiv:2205.11487, 2022. 1, 3, 4

[62] Aditya Sanghi, Hang Chu, Joseph G Lambourne, Ye Wang, Chin-Yi Cheng, Marco Fumero, and Kamal Rahimi Malekshan. Clip-forge: Towards zero-shot text-to-shape generation. In CVPR, 2022. 1, 3

[63] Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, et al. Laion-5b: An open large-scale dataset for training next generation image-text models. In Neur IPS, 2022. 3

[64] Christoph Schuhmann, Richard Vencu, Romain Beaumont, Robert Kaczmarczyk, Clayton Mullis, Aarush Katta, Theo Coombes, Jenia Jitsev, and Aran Komatsuzaki. Laion-400m: Open dataset of clip-ﬁltered 400 million image-text pairs. ar Xiv preprint ar Xiv:2111.02114, 2021. 3

[65] Junyoung Seo, Wooseok Jang, Min-Seop Kwak, Jaehoon Ko, Hyeonsu Kim, Junho Kim, Jin-Hwa Kim, Jiyoung Lee, and Seungryong Kim. Let 2d diffusion model know 3d-consistency for robust text-to-3d generation. ar Xiv preprint ar Xiv:2303.07937, 2023. 2, 3, 5, 7, 8, 9, 19, 20

[66] Tianchang Shen, Jun Gao, Kangxue Yin, Ming-Yu Liu, and Sanja Fidler. Deep marching tetrahedra: a hybrid representation for high-resolution 3d shape synthesis. In Neur IPS, 2021. 1, 3, 4, 5

[67] Uriel Singer, Adam Polyak, Thomas Hayes, Xi Yin, Jie An, Songyang Zhang, Qiyuan Hu, Harry Yang, Oron Ashual, Oran Gafni, et al. Make-a-video: Text-to-video generation without text-video data. In ICLR, 2023. 3

[68] Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. In ICLR, 2021. 3

[69] Stability.AI. Stable diffusion. https://stability.ai/blog/stable-diffusion-public-release, 2022. 1, 3, 4, 6, 10, 17

[70] Stability.AI. Stability AI releases Deep Floyd IF, a powerful text-to-image model that can smartly integrate text into images. https://stability.ai/blog/deepfloyd-if-text-to-image-model, 2023. 3

[71] Jingxiang Sun, Xuan Wang, Yichun Shi, Lizhen Wang, Jue Wang, and Yebin Liu. Ide-3d: Interactive disentangled editing for high-resolution 3d-aware portrait synthesis. ACM Transactions on Graphics (TOG), 2022. 3

[72] Jingxiang Sun, Xuan Wang, Lizhen Wang, Xiaoyu Li, Yong Zhang, Hongwen Zhang, and Yebin Liu. Next3d: Generative neural texture rasterization for 3d-aware head avatars. In CVPR, 2023. 1, 3

[73] Jiaxiang Tang. Stable-dreamfusion: Text-to-3d with stable-diffusion. https://github.com/ashawkey/ stable-dreamfusion, 2022. 6, 7, 8, 9, 16, 19, 20, 21

[74] Jiaxiang Tang. Fantasia3d.unofﬁcial. https://github.com/ashawkey/fantasia3d.unofficial, 2023. 8, 9, 16, 19, 20, 21, 22

[75] Junshu Tang, Tengfei Wang, Bo Zhang, Ting Zhang, Ran Yi, Lizhuang Ma, and Dong Chen. Make-it-3d: High-ﬁdelity 3d creation from a single image with diffusion prior. In ICCV, 2023. 2

[76] Christina Tsalicoglou, Fabian Manhardt, Alessio Tonioni, Michael Niemeyer, and Federico Tombari. Textmesh: Generation of realistic 3d meshes from text prompts. ar Xiv preprint ar Xiv:2304.12439, 2023. 2

[77] Dani Valevski, Matan Kalman, Yossi Matias, and Yaniv Leviathan. Unitune: Text-driven image editing by ﬁne tuning an image generation model on a single image. ar Xiv preprint ar Xiv:2210.09477, 2022. 3

[78] Patrick von Platen, Suraj Patil, Anton Lozhkov, Pedro Cuenca, Nathan Lambert, Kashif Rasul, Mishig Davaadorj, and Thomas Wolf. Diffusers: State-of-the-art diffusion models. https://github.com/ huggingface/diffusers, 2022. 6

[79] Haochen Wang, Xiaodan Du, Jiahao Li, Raymond A Yeh, and Greg Shakhnarovich. Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In CVPR, 2023. 2, 7

[80] Tengfei Wang, Bo Zhang, Ting Zhang, Shuyang Gu, Jianmin Bao, Tadas Baltrusaitis, Jingjing Shen, Dong Chen, Fang Wen, Qifeng Chen, et al. Rodin: A generative model for sculpting 3d digital avatars using diffusion. In CVPR, 2023. 3

[81] Jay Zhangjie Wu, Yixiao Ge, Xintao Wang, Stan Weixian Lei, Yuchao Gu, Wynne Hsu, Ying Shan, Xiaohu Qie, and Mike Zheng Shou. Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation. In ICCV, 2023. 3

[82] Jiale Xu, Xintao Wang, Weihao Cheng, Yan-Pei Cao, Ying Shan, Xiaohu Qie, and Shenghua Gao. Dream3d: Zero-shot text-to-3d synthesis using 3d shape prior and text-to-image diffusion models. In CVPR, 2023. 3

[83] Longwen Zhang, Qiwei Qiu, Hongyang Lin, Qixuan Zhang, Cheng Shi, Wei Yang, Ye Shi, Sibei Yang, Lan Xu, and Jingyi Yu. Dreamface: Progressive generation of animatable 3d faces under text guidance. ar Xiv preprint ar Xiv:2304.03117, 2023. 3, 7, 8

[84] Lvmin Zhang and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models. ar Xiv preprint ar Xiv:2302.05543, 2023. 3, 4, 5, 6, 10

[85] Rui Zhao, Wei Li, Zhipeng Hu, Lincheng Li, Zhengxia Zou, Zhenwei Shi, and Changjie Fan. Zero-shot text-to-parameter translation for game character auto-creation. In CVPR, 2023. 3

[86] Yinglin Zheng, Hao Yang, Ting Zhang, Jianmin Bao, Dongdong Chen, Yangyu Huang, Lu Yuan, Dong Chen, Ming Zeng, and Fang Wen. General facial representation learning in a visual-linguistic manner. In CVPR, 2022. 5

[87] Yufeng Zheng, Victoria Fernández Abrevaya, Marcel C. Bühler, Xu Chen, Michael J. Black, and Otmar Hilliges. I M Avatar: Implicit morphable head avatars from videos. In CVPR, 2022. 1, 3

[88] Wojciech Zielonka, Timo Bolkart, and Justus Thies. Instant volumetric head avatars. In CVPR, 2023. 3

Table 2: Hyper-parameters of Head Sculpt.

Camera setting

θ range (20, 110) Radius range (1.0, 1.5) Fo V range (30, 50)

Render setting

Resolution for coarse (64, 64) Resolution for ﬁne (512, 512) Max num steps sampled per ray 1024 Iter interval to update extra status 16

Diffusion setting

Guidance scale 100 t range (0.02, 0.98) ω(t) αt(1 αt)

Training setting

#Iterations for coarse 70k #Iterations for ﬁne 50k Batch size 4 LR of grid encoder 1e-3 LR of Ne RF MLP 1e-3 LR of si and vi in DMTET 1e-2 LR scheduler constant Warmup iterations 20k Optimizer Adam (0.9, 0.99) Weight decay 0 Precision fp16

Hardware GPU 1 Tesla V100 (32GB) Training duration 1h (coarse) + 1h (ﬁne)

A Implementation details

A.1 Details about 3D scene models

In the coarse stage, we make use of the grid frequency encoder γ( ) from the publicly available Stable Dream Fusion [73]. This encoder maps the input x R3 to a higher-frequency dimension, yielding γ(x) R32. The MLP within our Ne RF model consists of three layers with dimensions [32, 64, 64, 3+1+3]. Here, the output channels 3 , 1 , and 3 represent the predicted normals, density value, and RGB colors, respectively. In the ﬁne stage, we directly optimize the signed distance value si R, along with a position offset vi R3 for each vertex vi. We found that ﬁtting si and vi into MLP, as done by Fantasia3D [74], often leads to diverged training.

To ensure easy reproducibility, we have included all the hyperparameters used in our experiments in Tab 2. The other hyper-parameters are set to be the default of Stable-Dream Fusion [73].

A.2 Details about textual inversion

In the main paper, we discussed the collection of a tiny dataset consisting of 34 images depicting the back view of heads. This dataset was used to train a special token, <back-view>, to address the ambiguity associated with the back view of landmarks. The images in the dataset were selected to encompass a diverse range of gender, color, age, and other characteristics. A few samples from the dataset are shown in Fig. 11. While our simple selection strategy has proven effective in our speciﬁc case, we believe that a more reﬁned collection process could further enhance the controllability of the learned <back-view> token. We use the default training recipe provided by Hugging Face Diffusers 2, which took us 1 hour on a single Tesla V100 GPU.

2https://github.com/huggingface/diffusers/blob/main/examples/textual_inversion

Figure 11: Samples of the tiny dataset collected for learning <back-view> token.

B Further analysis

B.1 Effectiveness of textual inversion on 2D generation

To show the effectiveness of the learned <back-view> token, we conduct an analysis of its control capabilities in the context of 2D generation results. Speciﬁcally, we compare two generation results using Stable Diffusion [69], with both experiments sharing the same random seed. One experiment has the plain text prompt appended with the plain phrase back view, while the other experiment utilizes the learned special token <back-view> in the prompt. We present a selection of randomly generated results in Fig. 12. The observations indicate that the <back-view> token effectively inﬂuences the pose of the generated heads towards the back, resulting in a distinct appearance. Remarkably, the <back-view> token demonstrates a notable generalization ability, as evidenced by the Batman case, despite not having been trained speciﬁcally on back views of Batman in the textual inversion process.

B.2 Inherent bias in 2D diffusion models

In our main paper, we discussed the motivation behind our proposed identity-aware editing score distillation (IESD), which can be attributed to two key factors. Firstly, the limitations of promptbased editing [54, 39] are due to the inherent bias present in Stable Diffusion (SD). Secondly, while Instruct Pix2Pix (IP2P) [5] offers a solution by employing instruction-based editing to mitigate bias, it often results in identity loss. To further illustrate this phenomenon, we showcase the biased 2D outputs of SD and Control Net-based IP2P in Fig. 13. Modiﬁed descriptions and instructions are utilized in these respective methods to facilitate the editing process and achieve the desired results. The results provide clear evidence of the following: (1) SD generates biased outcomes, with a tendency to underweight the older aspect and overweight the skull aspect in the modiﬁed description; (2) IP2P demonstrates the ability to edit the image successfully, but it faces challenges in preserving the identity of the avatar.

The aforementioned inherent biases are ampliﬁed in the domain of 3D generation (refer to Fig. 10 in the main paper) due to the optimization process guided by SDS loss, which tends to prioritize view consistency at the expense of sacriﬁcing prominent features. To address this issue, our proposed IESD approach combines two types of scores: one for editing and the other for identity preservation. This allows us to strike a balance between preserving the initial appearance and achieving the desired editing outcome.

w/ back view w/ <back-view> w/ back view w/ <back-view> w/ back view w/ <back-view>

seed: 413 seed: 16772 seed: 40805

a DSLR portrait of Obama

seed: 50682 seed: 93440 seed: 96458

a DSLR portrait of Hillary Clinton

seed: 2367 seed: 19656 seed: 62156

a DSLR portrait of a boy with facial painting

seed: 53236 seed: 62424 seed: 72649

a DSLR portrait of Batman

Figure 12: Analysis of the learned <back-view> on 2D image generation. For each pair of images, we present two 2D images generated with the same random seed, where the left image is conditioned on the plain text "back view" and the right image is conditioned on the <back-view> token.

Landmark Map Stable Diffusion Reference Image Instruct Pix2Pix

seed: 19056 seed: 72854 seed: 50233 seed: 64136

Modiﬁed description: a DSLR portrait of +[older] Saul Goodman Instruction: make him older

seed: 5427 seed: 91282 seed: 60104 seed: 88141

Modiﬁed description: a DSLR portrait skull of Vincent van Gogh Instruction: turn his face into a skull

Figure 13: Analysis of the inherent bias in 2D diffusion models. For each case, we display several 2D outputs of SD and IP2P, utilizing modiﬁed descriptions and instructions, respectively, with reference images from our coarse-stage Ne RF model to facilitate the editing process.

Dream Fusion* [73] Latent-Ne RF [43] 3DFuse [65] Fantasia3D* [74] Head Sculpt (Ours)

a DSLR portrait of Batman

a DSLR portrait of Black Panther in Marvel

a DSLR portrait of Two-face in DC

a DSLR portrait of Doctor Strange

a head of Terracotta Army

Figure 14: Additional comparisons with existing methods on generation (Part 1). *Non-ofﬁcial.

C Additional qualitative comparisons

C.1 Comparison with existing methods on generation results

We provide more qualitative comparisons with four baseline methods [73, 43, 65, 74] in Fig. 14 and Fig. 15. These results serve to reinforce the claims made in Sec. 4.1 of the main paper, providing further evidence of the superior performance of our Head Sculpt in generating high-ﬁdelity head avatars. These results showcase the ability of our method to capture intricate details, realistic textures, and overall visual quality, solidifying its position as a state-of-the-art solution in this task.

Notably, to provide a more immersive and comprehensive understanding of our results, we include multiple outcomes of our Head Sculpt in the form of 360 rotating videos. These videos can be accessed on https://brandonhan.uk/Head Sculpt, enabling viewers to observe the generated avatars from various angles and perspectives.

Dream Fusion* [73] Latent-Ne RF [43] 3DFuse [65] Fantasia3D* [74] Head Sculpt (Ours)

a head of Simpson in the Simpsons

a head of Naruto Uzumaki

a DSLR portrait of Napoleon Bonaparte

a DSLR portrait of Leo Tolstoy

a DSLR portrait of Audrey Hepburn

a DSLR portrait of Obama with a baseball cap

a DSLR portrait of Taylor Swift

Figure 15: Additional comparisons with existing methods on generation (Part 2). *Non-ofﬁcial.

Dream Fusion* [73] Latent-Ne RF [43] Fantasia3D* [74] Head Sculpt (Ours)

a DSLR portrait of Saul Goodman

a DSLR portrait of +[older] Saul Goodman make him older

a DSLR portrait of Vincent van Gogh

a DSLR portrait skull of Vincent van Gogh turn his face into a skull

Figure 16: Comparisons with existing methods on editing.*Non-ofﬁcial.

C.2 Comparison with existing methods on editing results

Since the absence of alternative methods speciﬁcally designed for editing, we conduct additional evaluations of the editing results generated by existing methods by modifying the text prompts. Fig. 16 illustrates that bias in editing is a pervasive issue encountered by all the baselines. This bias stems from the shared SDS guidance function, which is based on a diffusion prior, despite the variations in representation and optimization methods employed by these approaches. Instead, IESD enables the guidance function to incorporate information from two complementary sources: (1) the original image gradient, which preserves identity, and (2) the editing gradient, which captures desired modiﬁcations. By considering both terms, our approach grants more explicit and direct control over the editing process compared to the conventional guidance derived solely from the input.

Latent-Ne RF [43] Fantasia3D* [74] Head Sculpt (Ours)

a head of ant-man in Marvel

Figure 17: Results across random seeds (0, 1, 2).*Non-ofﬁcial.

C.3 Comparison with existing methods on stability

We observe that all baselines tend to have diverged training processes as they do not integrate 3D prior to the diffusion model. Taking two shape-guided prior methods (i.e., Latent-Ne RF [43] and Fantasia3D [10]) as examples, we compare their generation results and ours across different random seeds. We conduct comparisons under the same default hyper-parameters and present the results in Fig. 17. We notice that prior methods need to try several runs to get the best generation while ours can achieve consistent results among different runs. Our method is thus featured with stable training, without the need for cherry-picking over many runs.