# remogpt_partlevel_retrievalaugmented_motionlanguage_models__4e6a384e.pdf Re Mo GPT: Part-Level Retrieval-Augmented Motion-Language Models Qing Yu, Mikihiro Tanaka, Kent Fujiwara LY Corporation {yu.qing, mikihiro.tanaka, kent.fujiwara}@lycorp.co.jp Generation of 3D human motion holds significant importance in the creative industry. While recent notable advances have been made in generating common motions, existing methods struggle to generate diverse and rare motions due to the complexity of motions and limited training data. This work introduces Re Mo GPT, a unified motion-language generative model that solves a wide range of motion-related tasks by incorporating a multi-modal retrieval mechanism into the generation process to address the limitations of existing models, namely diversity and generalizability. We propose to focus on body-part-level motion features to enable fine-grained text-motion retrieval and locate suitable references from the database to conduct generation. Then, the motion-language generative model is trained with prompt-based question-andanswer tasks designed for different motion-relevant problems. We incorporate the retrieved samples into the prompt, and then perform instruction tuning of the motion-language model, to learn from task feedback and produce promising results with the help of fine-grained multi-modal retrieval. Extensive experiments validate the efficacy of Re Mo GPT, showcasing its superiority over existing state-of-the-art methods. The framework performs well on multiple motion tasks, including motion retrieval, generation, and captioning. Introduction In recent years, there has been notable advancement in the development of pre-trained large language models (LLMs), e.g., GPT (Radford and Narasimhan 2018; Radford et al. 2019; Brown et al. 2020; Ouyang et al. 2022), BERT (Devlin et al. 2019), T5 (Raffel et al. 2020; Chung et al. 2022) and Llama (Touvron et al. 2023a,b). These innovations have enhanced the integration of language (Zhang et al. 2022; Touvron et al. 2023a), image (Radford et al. 2021; Wang et al. 2023; Li et al. 2022; Liu et al. 2023), 3D models (Youwang, Ji-Yeon, and Oh 2022; Mohammad Khalid et al. 2022; Cao et al. 2023), and multi-modal modeling including audio (Girdhar et al. 2023; Shukor et al. 2023), leading to impressive performance in various domains. Despite these improvements in LLMs, building a pre-trained model specifically for human motion and language is still in progress. Such a motion-language model would be able to Copyright 2025, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. Re Mo Diffuse Motion GPT Re Mo GPT R-Top3 MModality Diversity 2.816 9.763 BERTScore Cider Rouge 33.9 31.5 39.6 TM2T Motion GPT Re Mo GPT Text-to-motion Generation Motion-to-text Captioning 9.410 2.601 79.5 3.133 48.3 Figure 1: Re Mo GPT achieves the state-of-the-art performance in text-to-motion generation and motion-to-text captioning. solve various motion-related tasks through prompts, potentially benefiting diverse fields, including gaming, robotics, virtual assistants, and human behavior analysis. Prior research on human motion has delved into diverse motion-related tasks, including motion generation (Guo et al. 2022a; Tevet et al. 2023), motion captioning (Goutsu and Inamura 2021; Guo et al. 2022b), and motion prediction (Yuan and Kitani 2020; Zhang, Black, and Tang 2021). As these works solely focus on each individual task that they were designed to solve, the resulting models cannot easily be exported to other motion-language tasks, despite the notable accomplishments. Motion GPT (Jiang et al. 2023) was proposed to solve all these individual tasks simultaneously, by converting motion clips into motion tokens and learning to generate the motion tokens and texts through fine-tuning of pre-trained language models (Raffel et al. 2020; Chung et al. 2022). However, the versatility comes at a cost, as the method shows limited performance when confronted with unconventional or infrequent conditions of text inputs. In the context of motion generation, Re Mo Diffuse (Zhang et al. 2023b) attempts to introduce a retrieval-augmentation pipeline, a common technique to enhance LLMs (Lewis et al. 2020), to address the limitation. However, motion diffusion models are sensitive to the scale in classifier-free guidance and cannot generate motion captions, limiting the range of applications. Additionally, Re Mo Diffuse solely relies on the text-to-text similarity between captions using The Thirty-Ninth AAAI Conference on Artificial Intelligence (AAAI-25) T2T Retrieval by CLIP T2M Retrieval by PL-TMR The person is jumping very high. A person hops on their left foot. Someone is jumping with one foot. A person is shaking a bottle. T2T Retrieval by CLIP T2M Retrieval by PL-TMR A person is opening a bottle and drinking from that. A person leans to the right and shakes an item and then pours it and puts it back down. Figure 2: The comparison between samples obtained from text-to-text retrieval and text-to-motion retrieval. CLIP (Radford et al. 2021) for the retrieval, making it difficult to retrieve the correct motions. This difficulty is caused by the diversity of motion data, where similar motions can be described by different captions (and vice versa), e.g., a person is walking forward briskly. and the figure plants four steps leading with its left foot, with a fifth step not planted. both describe a sequence of walking. This limits the performance of retrieval-augmented generation. In this paper, we introduce Re Mo GPT, a generalized motion-language generative model enhanced with retrievalaugmentation, specifically designed to overcome the challenges outlined above. Re Mo GPT is based on natural language models and uses motion tokens as the representation of motion clips. We propose a fine-grained multi-modal retrieval technique that considers the body-part-level motion features to select appropriate references from a database, and generate diverse and high-quality motion clips, achieving state-of-the-art performance in motion-related tasks as shown in Fig. 1. As Fig. 2 demonstrates, the proposed cross-modal part-level text-motion retrieval (PL-TMR) can achieve better results than text-to-text retrieval by CLIP in some cases. The retrieved samples are included in the prompts for effective guidance in motion generation and motion captioning without hindering the performance of each task nor limiting the range of applications. We evaluate the efficacy of Re Mo GPT on two standard motion generation benchmarks, namely Human ML3D (Guo et al. 2022a) and Motion-X (Lin et al. 2023). Extensive quantitative results demonstrate that Re Mo GPT outperforms other existing motion-language models with the help of knowledge from the external database. With the use of multimodal retrieval-augmented generation, Re Mo GPT significantly enhances the generation quality for rare samples, demonstrating the generalizability of the proposed method. We outline our contributions as follows: We propose Re Mo GPT, a novel unified motion-language generative model that is augmented with multi-modal retrieval between motion and text, to solve various motionrelated tasks. To enable efficient multi-modal retrieval-augmented generation, we propose a novel body-part-level text-motion retrieval (PL-TMR) model that captures fine-grained motion features. We evaluate Re Mo GPT on several benchmarks and demonstrate the state-of-the-art performance of the proposed model across various tasks, including motion retrieval, generation, and captioning. Related Works Human Motion Generation. The task of generating motions deals with creating different and lifelike human movements using various inputs such as text (Ghosh et al. 2021; Guo et al. 2022a; Jiang et al. 2023), action (Petrovich, Black, and Varol 2021; Guo et al. 2020; Xin et al. 2023), and incomplete motion (Yuan and Kitani 2020; Zhang, Black, and Tang 2021; Ma et al. 2022; Tevet et al. 2023). Text-tomotion generation has recently garnered attention, as language is an intuitive interface for many users. MDM (Tevet et al. 2023) and Motion Diffuse (Zhang et al. 2024) generate motion using a diffusion-based generative model (Ho, Jain, and Abbeel 2020), which are also trained separately for various motion tasks. T2M-GPT (Zhang et al. 2023a) explores a generative framework using Vector Quantized Variational Autoencoders (VQ-VAE) (Oord, Vinyals et al. 2017) to quantize motion clips into discrete tokens, and train a Generative Pre-trained Transformer (GPT) for motion generation. Motion GPT (Jiang et al. 2023) considers human motion as a foreign language, by including the motion tokens transferred from motion clips in the prompt of language models to perform pre-training and instruction tuning of language models. The method is able to solve various motionrelated tasks simultaneously, including motion generation and motion captioning. However, when the text inputs are sampled from unconventional or infrequent conditions, the performance of the generation model significantly degrades. Human Motion Captioning. Describing human motion using natural language involves learning the correlation between motions and language, as demonstrated by (Takano and Nakamura 2015), which utilizes two statistical models. Additionally, recurrent networks, as explored in (Yamada, Matsunaga, and Ogata 2018; Plappert, Mandery, and Asfour 2018), have been employed for this purpose. More recently, TM2T (Guo et al. 2022b) introduces a novel motion representation method that condenses motions into a concise sequence of discrete variables. It then employs a neural machine translator (NMT) to establish mappings between the two modalities. However, the aforementioned Motion GPT (Jiang et al. 2023) is also able to solve this task, and performs better than TM2T (Guo et al. 2022b). Text-Motion Retrieval. In recent years, vision-language foundation models have received substantial attention (Radford et al. 2021), propelled by the availability of large collections of image-text pairs gathered from the internet. These models have inspired works in motion analysis. TMR (Petrovich, Black, and Varol 2023) employs contrastive training during motion generation to align text features with motion features. Motion Patches (Yu, Tanaka, and Fujiwara 2024) proposes an image representation of motion sequences and uses pre-trained image models to extract motion features. Both methods map motion and language into the same feature space, enabling multi-modal retrieval. Retrieval-Augmented Generation. Using retrieval to augment the generation process is a common technique for LLMs (Zhang et al. 2022; Touvron et al. 2023a) in natural language processing (Lewis et al. 2020). Recently, attempts to integrate retrieval-augmented generation methods into other domains have significantly increased. For instance, retrieval-augmented diffusion models (Blattmann et al. 2022) use user-assigned images instead of retrieval examples, enabling the effective transfer of artistic style from these images to the generated output. In the motion domain, Re Mo Diffuse (Zhang et al. 2023b) improves Motion Diffuse (Zhang et al. 2024) with a retrievalaugmentation pipeline by searching samples according to text-to-text similarity between captions. However, due to the diversity of motion and language, it is difficult to select the appropriate motion samples solely with text-to-text retrieval. Moreover, Re Mo Diffuse is only applicable to motion generation. In this paper, we propose Re Mo GPT, where finegrained part-level multi-modal retrieval between motions and captions is incorporated in a unified motion-language generative model, to efficiently solve various motion-related tasks with one model. Preliminaries Motion Tokenization. A motion with M frames, denoted as m1:M = {mi}M i=1 is first tokenized through a motion encoder E and a motion decoder D. L motion tokens z1:L = {zi}L i=1, where L = M/l and l is the downsampling rate, are obtained from E, and these tokens can be decoded back into the motion as ˆm1:M = D(z1:L) = D(E(m1:M)) based on the VQ-VAE (Oord, Vinyals et al. 2017) architecture. Motion-Language Models. Because a motion can also be tokenized, motion and language can be learned concurrently by merging the original text vocabulary Vt = {vi t}Kt i=1 with the motion vocabulary Vm = {vi m}Km i=1, denoted as V = {Vt, Vm}, where Vm maintains the order of the motion codebook Z. To solve the tasks of motion generation and captioning, (Jiang et al. 2023) proposed the use of a transformerbased model T5 (Raffel et al. 2020) as the motion-language model. The input of the model is a sequence of tokens Xin = {xini}Lin i=1, where xin V and Lin denotes the input length. In the same manner, the output of the model is Xout = {xouti}Lout i=1 , where xout V , and Lout represents the output length. The input tokens are processed by the transformer encoder, and the probability distribution of the potential next token at each step pθ(xout | xin) = Q i pθ xi out | x Text: this person is shaking his hand with his right hand. Motion: ### Output: Text-to-Motion Generation ### Input: Generate text: ### Context: Text: the person in sitting down eating dinner. Motion: Text: the person is chopping opinions. Motion: ### Output: A person is chopping vegetables. Motion-to-Text Captioning Figure 5: Samples of prompt used for the instruction tuning in Re Mo GPT. denotes the motion tokens paired with the caption. denote the motion tokens of multi-modal retrieved motion-caption pairs. trieval, we simply perform instruction tuning by including the retrieved samples in the prompt. Following (Jiang et al. 2023), several instruction prompts are designed for motion generation and motion captioning as shown in Fig. 5. For instance, an instruction prompt for the motion generation task could be Show me a motion that illustrates and for the motion captioning task, the instruction prompt could be Describe the motion illustrated in , where denotes the caption, and denotes a sequence of motion tokens generated by the motion tokenizer, respectively. Different from the existing method, we include the retrieval results as the context in the prompt to provide more informative features for the generation. Therefore, during the training phase, the goal is to maximize the log-likelihood of the data distribution as follows: i=0 log pθ xi out | x