# humantomato_textaligned_wholebody_motion_generation__3d1a33db.pdf Human TOMATO: Text-aligned Whole-body Motion Generation Shunlin Lu Ling-Hao Chen Ailing Zeng Jing Lin Ruimao Zhang Lei Zhang Heung-Yeung Shum Co-first author. Listing order is random. {shunlinlu0803, thu.lhchen}@gmail.com Tsinghua University International Digital Economy Academy (IDEA) School of Data Science, The Chinese University of Hong Kong, Shenzhen (CUHK-SZ) This work was done when S. Lu, L.H. Chen, and J. Lin were research interns at IDEA. Corresponding authors: R. Zhang and H.-Y. Shum, Project lead: A. Zeng. Project page: https://lhchen.top/Human TOMATO (a) stand and shush, angrily. (b) Yang-style 40 form Tai Chi Competition routine step 34, happily. Figure 1: The proposed Human TOMATO can generate text-aligned whole-body motions with vivid and harmonious face, hand, and body motion. We show two generated motion keyframes based on the given texts. This work targets a novel text-driven whole-body motion generation task, which takes a given textual description as input and aims at generating high-quality and diverse facial expressions, hand gestures, and body motions simultaneously. Previous works on text-driven motion generation tasks mainly have two limitations: they ignore the key role of fine-grained hand and face controlling in vivid whole-body motion generation, and lack a good alignment between text and motion. To address such limitations, we propose a Text-aligned wh Ole-body Motion gener ATi On framework, named Human TOMATO, which is the first attempt to our knowledge towards applicable holistic motion generation in this research area. To tackle this challenging task, our solution includes two key designs: (1) a Holistic Hierarchical VQ-VAE (aka H2VQ) and a Hierarchical- . Proceedings of the 41 st International Conference on Machine Learning, Vienna, Austria. PMLR 235, 2024. Copyright 2024 by the author(s). GPT for fine-grained body and hand motion reconstruction and generation with two structured codebooks; and (2) a pre-trained text-motionalignment model to help generated motion align with the input textual description explicitly. Comprehensive experiments verify that our model has significant advantages in both the quality of generated motions and their alignment with text. 1. Introduction Recent years have seen an explosion of huge demand for generating high-quality 3D human motions in many scenarios, such as games, films, animation, and robotics. To reduce laborious efforts in animation creation, recent studies (Tevet et al., 2023; Chen et al., 2023b; Zhang et al., 2022; 2023) attempt to generate human motions with textual description in a natural interactive way and have achieved rapid progress. However, the generated motions from existing works are still unsatisfactory to meet real application needs. The problem is mainly due to two aspects. First, existing text-driven motion generation models can only generate body-only mo- Human TOMATO: Text-aligned Whole-body Motion Generation tions rather than whole-body motions, which are highly expressive yet much more challenging. On the one hand, the mentioned challenge comes from the limited availability of whole-body motion data. On the other hand, whole-body motion is much more complex, where fine-grained motions of body, hand, and face should be well generated. How to model whole-body human motions is still under-explored. Second, the generated motions lack semantic alignment with the textual description. Existing methods adopt CLIP (Radford et al., 2021) or Large Language Models (LLMs) (Raffel et al., 2020) to provide language guidance for motion generation (Zhang et al., 2022; Tevet et al., 2023; 2022; Jiang et al., 2023). However, their alignment supervision is provided at frame level and lacks sufficient understanding of a motion at its whole sequence level. As a result, they often fail to distinguish some scenarios, such as walking in a clockwise circle and walking in a counter-clockwise circle , which requires understanding motions at sequence level rather than frame level. Such a drawback severely limits the ability to generate motions well-aligned with textual descriptions. To tackle the above issues, we propose a novel Textaligned wh Ole-body Motion gener ATi On framework (Human TOMATO), which includes two key designs. First, a holistic hierarchical discrete modeling strategy for body and hand motions is proposed for reconstructing and generating whole-body motions vividly. As whole-body motion is a kind of high-dimensional spatio-temporal signal, in the first stage, we propose a Holistic Hierarchical VQVAE (aka H2VQ) to compress the motion into two-level discrete codes for body and hand, respectively. In contrast, a na ıve solution that simply replaces body-only motions with whole-body motions or directly increases the size of the codebook is almost in vain. The key insight of our H2VQ is learning informative and compact representations of fine-grained whole-body motions at very low bit rates. Moreover, the hand and body motions have different levels of amplitudes and details, which motivates us to model them separately. Based on the two-level discrete codes, in the second stage, we propose a Hierarchical-GPT to predict the hierarchical discrete codes of body and hand in an auto-regressive fashion. Extending the hierarchical modeling strategy of body-hand motions, we use an RVQ-based method for motion reconstruction and also generate facial motion with discrete codes auto-regressively. Second, a pre-trained text-motion-alignment model is introduced to enhance the textual alignment of generated motions for the first time. We pre-train a motion encoder and a text encoder, namely TMA (Text-Motion Alignment), in a contrastive learning manner (Radford et al., 2021) with pairwise text-motion data. Unlike previous work (Zhang et al., 2022; Tevet et al., 2023; 2022; Jiang et al., 2023) that relied on CLIP or LLMs embedding, our approach utilizes TMA text embedding as a language prior. In this way, the TMA provides a motion-aware language embedding for the Hirarchical-GPT to generate discrete motion codes more precisely. It is worth noting that, during training, merely supervising the prediction of discrete code tokens of body and hand is insufficient as it lacks supervision on the semantics of global motion sequence and leads to error accumulation in auto-regressive prediction. Thus, with the text-motion similarity measured by TMA, we additionally provide textmotion alignment supervision to supervise the alignment between generated motion sequences and texts explicitly. With these key designs, compared with previous text-driven motion generation works, Human TOMATO can generate whole-body motions semantically aligned with texts, as illustrated in Figure 1. To evaluate the alignment between generated motions and input texts, we further revisit the previous retriever used for evaluating text-motion alignment and find that its retrieval ability is worse than TMA. Hence, we introduce two new criteria (TMA-R-Precision(256) and TMA-Matching-score), which are more accurate and challenging to evaluate the text-motion alignment in this task. We summarize our key contributions as follows: To the best of our knowledge, we propose the challenging Text-driven wh Ole-body Motion gener ATi On task for the first time and design a model (Human TOMATO) to generate vivid whole-body motions aligned with texts. To tackle the challenging whole-body motion generation problem, we introduce a H2VQ for fine-grained body and hand motion reconstruction. Accordingly, we develop a Hierarchical-GPT combined with a facial motion generator to generate whole-body motions. To enhance the consistency and alignment between texts and motions, we pre-train text-motion-aligned encoders via a contrastive objective and introduce sequence-level semantic supervision to help motion-text alignment. We propose two new criteria (TMA-R-Precision(256) and TMA-Matching-score), which are more accurate and challenging for evaluating text-motion alignment. We evaluate Human TOMATO on both whole-body (Lin et al., 2023b) and body-only (Guo et al., 2022) motion generation benchmarks and answer four research questions based on our contributions. Comprehensive experiments affirm the vividness and alignment of our generated motions, outperforming competitors in motion reconstruction (32.7% MPJPE v.s. VQ) and motion generation metrics (9.2% TMA-R-Precision(256) Top3 v.s. the best baseline). 2. Related Work Due to the page limitation, we leave more discussions on related work in Appendix A. There are three related research, Human TOMATO: Text-aligned Whole-body Motion Generation including unconditional motion generation (Yan et al., 2019; Zhao et al., 2020; Zhang et al., 2020; Cai et al., 2021), textdriven motion generation (Petrovich et al., 2022; Zhang et al., 2022; Chen et al., 2023b; Guo et al., 2022), and cospeech motion generation (Yi et al., 2023; Zhi et al., 2023; Fan et al., 2022; Habibie et al., 2021). However, these works cannot generate whole-body motion from text. Besides, the effective quantization method and how to achieve higher text-motion alignment are not been carefully explored yet. Accordingly, we introduce our methodology as follows. 3. Methodology 3.1. Problem Formulation We clarify notations and set up the novel research problem of text-driven whole-body motion generation. Given a text description t of a human motion, such as The man is playing the ukulele happily. , the model should generate a vivid whole-body motion m = [m1, m2, , m L] RL d aligned with the text description, where L and d denote the number of frames and the dimension of the motion in each frame, respectively. As whole-body motion comes up with hand, body, and face motions, we can also decompose the m as {m H, m B, m F } respectively, where m H RL dh, m B RL db, m F RL df , d = dh + db + df. Mathematically, we formulate the text-driven whole-body motion generation as follows: Θ = arg max Θ PΘ(m | t), (1) where Θ denotes the model parameters and PΘ( ) denotes the motion distribution, respectively. 3.2. Learning Discrete Whole-body Representations Vanilla Motion VQ-VAE. Motion VQ-VAE aims to learn discrete representations of human motions in an encodingdecoding fashion. Specifically, VQ-VAE recovers motions by using an auto-encoder and learns a codebook C = {ek}K k=1, where K denotes the codebook size and e(k) indicate the k-th embedded representation in the codebook. Given a vector z and the quantizer Q( ; C), the quantized vector should be the element selected from the codebook C that can minimize the reconstruction error of z as, ˆz = Q(z; C) = arg min ek z ek 2 2. (2) In a vanilla VQ-VAE, z = Enc(m) indicates the latent code extracted from a motion encoder Enc( ). Thus VQ-VAE can be optimized by, L = m Dec(Q(z; C)) 2 2 + α z sg(ˆz) 2 2, (3) where α is the hyper-parameter, sg( ) is the stop-gradient operation and Dec( ) indicate the motion decoder. Different from traditional methods, the codebook C in motion VQ-VAE is optimized by exponential moving average ( EMA) and codebook reset (Code Reset) operations following Razavi et al. (2019); Van Den Oord et al. (2017); Zhang et al. (2023). While the discrete vector quantization of vanilla VQ-VAE is capable of compressing human motions, it falls short in minimizing quantization errors for detailed whole-body motion generation. In practice, an intuitive solution to address this would be to increase the size of the codebook. However, this scheme would evidently introduce additional computational cost and quickly encounter performance bottlenecks (see results in Section 4.4). Holistic Hierarchical VQ-VAE. Recently, the Residual Vector Quantization technique, also known as RVQ (Barnes et al., 1996; Zeghidour et al., 2021; Yao et al., 2023), has significantly advanced the development of music generation task (D efossez et al., 2022; Copet et al., 2023). Technically, RVQ iteratively quantizes the quantization error at each level from the previous one, reducing quantization errors effectively while maintaining a low memory cost of the codebook (see Appendix C.2 for details). Motivated by this (D efossez et al., 2022), we propose a novel Holistic Hierarchical Vector Quantization scheme, shorted by H2VQ, into the field of motion generation. Unlike RVQ, we incorporate the kinematic structure prior to the H2VQ modeling, enabling it to learn compact representations of fine-grained whole-body motions at an extremely low bit rate. Given the distinct differences in amplitude and frequency between body and hand motions, we have further designed two separate encoders and codebooks to learn discrete representations for body and hand motions. The architecture of our proposed H2VQ is illustrated in Figure 2(a). In the encoding phase, we input hand and body motions, obtaining hand and body tokens through the hand encoder Enc H( ) and body encoder Enc B( ), respectively. The learned hand tokens are further quantized by the Hand Quantizer QH( ; CH) as z H. Since the body motions are usually highly associated with some hand gestures (Ao et al., 2022), to train a more natural and coordinated body codebook, we fuse the body and hand tokens using the Concat( ) and Conv1d( ) operations. As shown in Figure 2, before this fusion operation, the quantized hand tokens undergo a transformation through a projection layer. After that, fused tokens are further quantized by Body Quantizer QB( ; CB) as z B. Finally, the hand tokens z H and body tokens z B are concatenated together and fed into the Body-hand Decoder to reconstruct the body-hand motions precisely. During the training phase, the primary goal is to reconstruct motions while concurrently updating the two codebooks through the EMA and Code Reset operations (Razavi et al., 2019; Van Den Oord et al., 2017; Zhang et al., 2023). In the inference phase, after obtaining quantized code indices, the Body-hand Decoder can generate body-hand motions Human TOMATO: Text-aligned Whole-body Motion Generation Codebook 1 Training Inference Facial Motion (reconstructed) (b) Facial Motion Tokenization (RVQ) Facial Motion Facial Encoder Facial Decoder ... 1 2 3 4 K1 ... 1 2 3 4 K1 ... 12 19 9 ... 1 20 31 Residual quantization quantization dequantization dequantization Face quantizer (a) Body-hand Motion Tokenization (H2VQ) Hand quantizer ... 1 2 3 4 K1 Transformation Body quantizer ... 1 2 3 4 K1 hand Decoder Codebook 2 Hand Motion Body Motion Motion (reconstructed) quantization dequantization quantization dequantization Addition Subtraction Hand Encoder Body Encoder Figure 2: The framework overview of tokenization method for body-hand, and facial motions. (a) Holistic Hierarchical Vector Quantization (H2VQ) to compress fine-grained body-hand motion into two discrete codebooks with hierarchical structure relations. (b) Residual Vector Quantization (RVQ) to compress facial motion into two discrete codebooks with hierarchical structure relations. by querying the respective codebooks with obtained code indices. The detailed algorithmic flows for both training and inference phases can be found in Appendix C. 3.3. Hierarchical Whole-body Motion Generation Given the two precise quantized codebooks of H2VQ, the motion sequence should be generated by using the corresponding decoders and quantized codes. The previous popular approach is to predict code indices in GPT-like autoregressive fashion (Zhang et al., 2023). Since the proposed H2VQ requires the usage of two codebooks with structure relations, the aforementioned approach is not applicable. To better model the natural coherence of body-hand motions, we design a hierarchical discrete codes prediction module, named Hierarchical-GPT, which is illustrated in Figure 3(a), for generating body-hand motions. Hierarchical-GPT. The Hierarchical-GPT is built upon a transformer-based architecture, where the first input token is a textual embedding. With the input bodyhand motion m B = [m B 1 , m B 2 , , m B L] and m H = [m H 1 , m H 2 , , m H L ], we have corresponding code indices, denoted as IB = [IB 1 , IB 2 , , IB L/r, End] and IH = [IH 1 , IH 2 , , IH L/r, End], where End indicates the end token and r denotes the down-sampling rate, which is used to convert the input motion sequence to discrete motion tokens. Therefore, as shown in Figure 3(a), the code indices prediction mechanism can be formulated as an auto-regressive prediction problem: P(IB,H 1,2, ,L/r | t) = QL/r s=1P(IB,H s | IB,H