# unimumo_unified_text_music_and_motion_generation__7b01b007.pdf Uni Mu Mo: Unified Text, Music and Motion Generation Han Yang1, Kun Su2, Yutong Zhang3, Jiaben Chen4, Kaizhi Qian5, Gaowen Liu6, Chuang Gan4 1The Chinese University of Hong Kong, 2University of Washington, 3The University of British Columbia 4University of Massachusetts Amherst, 5MIT-IBM Watson AI Lab, 6Cisco Research We introduce Uni Mu Mo, a unified multimodal model capable of taking arbitrary text, music, and motion data as input conditions to generate outputs across all three modalities. To address the lack of time-synchronized data, we align unpaired music and motion data based on rhythmic patterns to leverage existing large-scale music-only and motion-only datasets. By converting music, motion, and text into token-based representation, our model bridges these modalities through a unified encoder-decoder transformer architecture. To support multiple generation tasks within a single framework, we introduce several architectural improvements. We propose encoding motion with a music codebook, mapping motion into the same feature space as music. We introduce a music-motion parallel generation scheme that unifies all music and motion generation tasks into a single transformer decoder architecture with a single training task of music-motion joint generation. Moreover, the model is designed by fine-tuning existing pre-trained single-modality models, significantly reducing computational demands. Extensive experiments demonstrate that Uni Mu Mo achieves competitive results on all unidirectional generation benchmarks across music, motion, and text modalities. Code https://github.com/hanyangclarence/Uni Mu Mo Website https://hanyangclarence.github.io/unimumo demo/ Extended version https://arxiv.org/abs/2410.04534 Introduction Music and body movements are synchronized and inseparable. The beat and metrical structures in rhythm encourage the spontaneous coordination of body motion with music (Large 2000), activating the motor-related areas of human brains (Keller and Rieger 2009). Dance particularly exemplifies this connection through choreography that aligns with the music s rhythm, melody and emotion. Meanwhile, even though most people are not professional musicians or dancers, they often interpret music and dance using simple, natural language. This descriptive text serves as a vital bridge between understandable ideas and abstract concepts in music and motion. Copyright 2025, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. The synergy between music, motion, and text provides a natural motivation to create a model capable of understanding and creating contents across all these modalities. Moreover, building a framework that can flexibly generate music, motion, and text in arbitrary combinations is crucial for real-world applications, even though existing models already achieve impressive results in unidirectional generation tasks such as text-to-music (Copet et al. 2023), music-to-motion (Tseng, Castellon, and Liu 2023), motionto-music (Zhu et al. 2022a) and motion-to-text (Jiang et al. 2023). In the real world, there is a demand for diverse generative abilities, and more complex generation tasks may be necessary, such as creating dance sequences based on both music and textual descriptions. Training individual models for each unique combination, although potentially yielding better output quality, would significantly increase training costs, deployment efforts and storage requirements. Thus, a unified model that supports all combinations of conditioning and generation tasks, rather than a collection of separate models or training adapters to incorporate individual models, offers a more cost-effective solution. To this end, we introduce a novel task of dynamically generating music, motion, and text in a multitude of combinations unifiedly. As demonstrated in Fig. 1, this task is designed to handle diverse generative scenarios, ranging from text-to-music, text-to-motion, to more complex combinations like text-to-music-plus-motion or music-plus-text-to-motion. However, the task could be challenging, especially in two aspects: i) the lack of comprehensive datasets that include all three modalities - music, motion, and text - limits the development of a general and unified model. While there are individual datasets for music-only (Santana et al. 2020), motion-only (Mahmood et al. 2019), music to motion (Li et al. 2021b) and text to motion (Guo et al. 2022a), a holistic and large-scale dataset that encompasses all three modalities still remains absent; ii) designing a unified architecture that supports both the conditioning and generation of all three modalities is challenging, mainly due to the significant differences between the neural representations for the three modalities and the multiplicity of desired generation tasks. To address the first challenge of lacking paired data, we propose to align unpaired music and motion sequences based on their rhythmic patterns. Specifically, we extract both music beats and motion visual beats, then employ dynamic time The Thirty-Ninth AAAI Conference on Artificial Intelligence (AAAI-25) Figure 1: Uni Mu Mo is able to perform generation tasks on any combination of music, motion, and text. The tasks shown in the figure include text-to-aligned-music-motion, music-to-motion, motion-to-music, music-captioning, and motion-captioning. warping to find the alignment and warp the motion sequence to adjust the motion visual beats to match the music beats. We found that such augmentation is accurate and efficient. With the augmented synchronized music-motion data, we can utilize existing music and motion datasets to train our unified generative model. Additionally, we construct text descriptions from music and motion metadata using a mixture of template filling, large language model generation and musicbased language model generation, striking a balance between diversity, language fluency and description accuracy. To overcome the second challenge, we propose a novel framework, Uni Mu Mo, to unify the generation of different modalities. Our pipeline consists of three main stages: a music-motion joint tokenizer that encodes music and motion sequences into discrete representations within the same space, a music-motion transformer-decoder model trained on the task of music-motion joint generation, and a music-motion captioner that generates text descriptions from music and motion features. In the first stage, we bridge the modality gap between music and motion by mapping motion into the music feature space. Specifically, instead of using separate Vector-Quantized Variational Autoencoders (VQ-VAE) to quantize music and motion sequences, we encode motion with the codebook of a pre-trained music VQ-VAE, namely Encodec (D efossez et al. 2022). This design facilitates the unification of music and motion within the same generative framework in the subsequent stage. In the second stage, we train a unified music and motion generative model with a novel task of music-motion joint generation from text conditions. To enable the mutual conditioning of music and motion, and unlock the music-to-motion and motion-to-music generation capabilities, we introduce a novel music-motion parallel generation scheme, where we perform two mutually conditioned streams of autoregressive generation of aligned music and motion simultaneously. With the reuse of Encodec and joint encoding of motion in the previous stage, the current stage can be effectively achieved by fine-tuning the pretrained text-to-music model associated with Encodec, namely Music Gen (Copet et al. 2023), equipping it with additional motion conditioning and generation capabilities while main- taining its music generation capabilities. In the third stage, we fine-tune a T5 decoder for music and motion captioning tasks, using the features extracted by the music-motion decoder trained in stage 2. To transform the decoder into an effective feature extractor, we replace its causal self-attention layers with trainable full self-attention layers, and fine-tune them together with the T5 decoder on music and motion captioning tasks. Extensive experiments demonstrate that Uni Mu Mo achieves competitive performance across all unidirectional generation tasks in music, motion, and text when compared with existing state-of-the-art models, demonstrating the effectiveness and versatility of our approach. Our work offers significant advancements in multimodal generative research, summarized as follows: To the best of our knowledge, this is the first unified framework capable of arbitrarily generating content across music, motion, and text. To address the shortage of paired multimodal data, we augment and enrich existing large-scale datasets with musicmotion data alignment and text augmentations. We propose a novel joint codebook for encoding music and motion sequences, along with a music-motion parallel generation scheme, facilitating multiple generation tasks within a single architecture. Our framework achieves results comparable to SOTAs across all generation tasks in music, motion, and text. Related Work Text to Music. Text-conditioned music generation has been widely studied in recent years. There are two main branches: diffusion-based and transformer-based. For diffusion-based models, Riffusion (Forsgren and Martiros 2022) uses a latent text-to-image diffusion model to generate spectrograms, which are then converted into audio clips; Mousai (Schneider, Jin, and Sch olkopf 2023) proposes training a diffusion model in the latent space of a diffusion autoencoder; Noise2Music (Huang et al. 2023a) introduces a cascade of diffusion models that first generates the audio in a coarse form and then progressively refine it. Audio LDM (Liu et al. 2023a) proposes to train a latent diffusion model using CLAP (Wu et al. 2023) embeddings, a language-audio joint representation, for text conditioning. For transformer-based models, Music LM (Agostinelli et al. 2023) proposes to encode music into high-level semantic tokens and low-level acoustic tokens , and use a cascade of transformer decoders to generate the two levels stage by stage. Music Gen (Copet et al. 2023) leverages a single-stage transformer decoder to model the hierarchical music tokens directly. Music to Text. Several models have been proposed for audio captioning. WAC (Kadlˇc ık et al. 2023) proposes to transfer a pre-trained speech-to-text Whisper model to the music captioning task. LTU (Gong et al. 2023) takes the concatenated music embeddings and text embeddings as input to a large language model and directly trains caption generation using language modeling objectives. LP-Music Caps (Doh et al. 2023) uses a transformer encoder-decoder structure, where the music spectrogram is first encoded by the encoder and then cross-attended by the decoder for text generation. MU-LLa MA (Liu et al. 2023b) leverages a frozen LLa MA (Touvron et al. 2023) and fine-tunes a Music Understanding Adapter to fuse music features into the LLa MA model. Music to Motion. Most of the works on music-conditioned dance generation are based on transformers. Several approaches (Li et al. 2021a; Fan et al. 2022; Pu and Shan 2022) adopt similar structures that first use a music transformer encoder and a motion transformer encoder to encode music and initial motion into representations separately, and then employ a transformer decoder for cross-modal fusion and motion generation. Bailando (Siyao et al. 2022) proposes to train a transformer on motion features encoded by a choreographic memory module, which is the codebook of a motion VQVAE. Besides autoregressive transformers, EDGE (Tseng, Castellon, and Liu 2023) adopts a transformer-based diffusion model capable of both dance generation and editing. Motion to Music. Most of the relevant works focus on generating corresponding music from video input. Foley Music (Gan et al. 2020) focuses on generating music for videos of people playing instruments, and uses Musical Instrument Digital Interface (MIDI) to bridge the gap between body key points and the final music. Similarly, Rhythmic Net (Su, Liu, and Shlizerman 2021) extends the scenarios to arbitrary motion videos by first estimating visual rhythm and conditionally generating drum and piano music. Dance2Music (Aggarwal and Parikh 2021) encodes a dance similarity matrix with CNN and predicts the next note with an LSTM autoregressively. CDCD (Zhu et al. 2022b) proposes a single-stage method that uses a discrete latent diffusion model to generate music spectrograms conditioned on video features. D2MGAN (Zhu et al. 2022a) proposes a GAN-based model to generate the music tokens based on video and pose features. Text-Music-Motion Aligned Data Generation To model arbitrary generation across music, motion, and text, we propose to expand existing music and motion datasets by aligning motion with music and synthesizing textual descriptions. The data generation pipeline includes four major steps: 1) music beat detection, 2) visual beat detection, 3) music-motion alignment, and 4) text description synthesis. Music Beat Detection. We estimate music beats from a music waveform Y RTw, where Tw represents the number of samples, using a Bidirectional-LSTM-based model from (Chiu, Su, and Yang 2021). This model performs beat tracking on extracted drum features and non-drum features separately, then aggregates the results with a learnable fuser. We manually evaluate the accuracy of this beat tracking model and find that it performs well in most test cases, outperforming the beat tracking methods in the Librosa API (Mc Fee et al. 2015). The resulting music beats are represented as a binary sequence Bm RTw, where each frame is marked as beat or non-beat. Visual Beats Detection. Given a 3D motion sequence M RTm J 3 where Tm represents the number of frames, J the number of joints, and the last dimension indicates x, y, z coordinates, we obtain visual beats in three steps. In the first stage, we calculate the motion directogram (Davis and Agrawala 2018), a 2D matrix that factors motion into different motion angles, similar to how an audio spectrogram factors sound amplitude into different frequencies. Specifically, we first compute the first-order difference of the motion sequence Mt = Mt Mt 1. Based on its motion angle, we assign the motion magnitude of every joint into one of the bins in 2π/Nbins. The motion directogram Md(t, θ) is obtained by summing the motion magnitudes of each bin: Md(t, θ) = P j Mt(j)1θ( Mt(j)), where 1θ(ϕ) = 1 if |θ ϕ| 2π/Nbins else 0. In the second stage, we convert the motion directogram to the kinematic offset Mk, which represents the motion changes, similar to the onset envelope in an audio spectrogram. We first obtain motion flux Mf, which represents the deceleration in various directions, by computing the negative first-order difference of the directogram Md. We then average each frame of Mf and filter the top 1% peaks to obtain kinematic offset Mk. In the last stage, we use dynamic programming to compute the visual beats by designing an objective function that selects strong visual changes from kinematic offsets and encourages equal-spacing beats. More details can be found in Appendix. The final visual beats are also represented as a binary sequence Bv RTm, where each frame is marked as beat or non-beat . Music-Motion Alignment. We apply dynamic time warping to determine the optimal matching between music beats Bm and visual beats Bv, finding the alignment even though the duration of these two binary sequences could be different. Finally, we warp motion sequences by interpolating according to the warping curve to obtain aligned music-motion pairs. The reason for warping motion to match music, rather than the reverse, is that music beats tend to be steady, so warping music could result in perceptually unacceptable changes. More details can be found in Appendix. Text Description Synthesis. To compensate for the absence of text descriptions in our used datasets, we employ two methods for captions synthesis: (1) using Music Understanding Language Model to generate caption directly from audio; and (2) using Large Language Model to synthesize captions from metadata (genre, tempo, etc.), striking a balance between musical accuracy and diversity. Examples and more details are shown in Appendix. Uni Mu Mo Framework Uni Mu Mo consists of three training stages to enable arbitrary generation between music, motion, and text. In stage 1, we encode aligned music and motion data into discrete tokens. To efficiently bridge the gap between the two modalities, we propose to use a frozen pre-trained audio tokenizer Encodec (D efossez et al. 2022) and train a motion tokenizer that reuses the same residual codebooks of the audio tokenizer. In stage 2, we fine-tune a state-of-the-art text-to-music transformer decoder (Copet et al. 2023) by conducting the task of generating music and motion tokens simultaneously with music and motion text descriptions. At the inference stage, we can perform parallel generation to unlock applications of music and motion generation. In stage 3, we treat the pretrained music-motion decoder model in stage two as a feature extractor and fine-tune a T5 decoder on language modeling task for music and motion captioning. An overview of the Uni Mu Mo framework is shown in Figure 2. Stage 1. Music and Motion Joint Tokenization While existing tokenization approaches can faithfully reconstruct the music or motion individually, the correlations between the two modalities become intricate in distinct spaces. Therefore, directly applying them in the unified generation framework poses challenges. Besides, a music tokenizer usually requires more training resources and time to achieve high-quality reconstruction than a motion tokenizer. Inspired by these facts, we introduce an efficient and effective way to encode music and motion into a joint latent space. We propose using a pre-trained audio tokenizer, Encodec (D efossez et al. 2022), and training a new motion encoder-decoder. The motion encoder encodes the motion into the same embedding space as the music and reuses the frozen music Residual Vector Quantizers (RVQ) to discretize the motion into tokens. From these tokens, the motion decoder can decode to reconstruct the motion. Given the higher complexity and richer information in music compared to motion, the learned music codebook is theoretically capable of encoding motion. Specifically, given a waveform Y RT fw with T the audio duration and fw the sample rate, Encodec first encodes it into a continuous tensor of Xmusic Rd T fr, where fr fw is the frame rate of the residual codebook and d is the dimension of codebook entries. Xmusic is then quantized by the RVQ into music tokens Qmusic {1, . . . , M}K T fr, where K is the number of RVQ and M is the number of codebook entries. For an aligned motion sequence of the same duration M Rdm T fm with frame rate fm and feature dimension dm, our motion encoder encodes it into Xmotion Rd T fr, the same shape as Xmusic, which is then tokenized by the same RVQ into motion tokens Qmotion {1, . . . , M}K T fr. The motion decoder decodes the motion feature after RVQ, resulting in ˆ M. The motion encoder-decoder is trained by minimizing the motion reconstruction loss together with a commitment loss Lcommit from the codebook: Ltotal = 1 |D| M D ( M ˆ M 2 + λLcommit) (1) where D is the motion dataset and λ controls the strength of the commitment loss. Empirically, λ is set to 0.02. With this design, the music-motion joint tokenization can effectively learn multimodal correlations by mapping motion features into the same space as music, without the need to train another computationally heavy music autoencoder. Moreover, it enables direct use the text-to-music model associated with Encodec as an initialization for the following music-motion decoder model, significantly reducing training costs and enhancing the performance. Experimentally, such feature alignment is crucial to learning the joint generation of music and motion within a single transformer model. Stage 2. Music and Motion Generation from Text In this stage, we modify and fine-tune an existing state-of-theart text-to-music model with the music and motion tokens extracted from Stage 1, enabling it to handle all tasks related to music and motion generation, such as text-to-musicmotion and motion-to-music. In particular, we employ Music Gen (Copet et al. 2023), an open-source, single-stage transformer decoder model that can generate multi-level music tokens with a specific codebook interleaving pattern. Following their practice, we apply the delay pattern for both music and motion tokens, utilize a T5 encoder for encoding text descriptions, and adopt cross-attention to incorporate text conditioning features into the transformer decoder. To enable the autoregressive generation of music and motion within a unified framework, we propose training on the task of music-motion joint generation, together with a novel parallel generation scheme, where two streams (i.e., music and motion) of predict-next-token generation are conducted simultaneously, with each stream conditioned on each other. Specifically, given the music tokens Qmusic and motion tokens Qmotion with the same shape K S where S = T fr is the sequence length, we first transform them with delay pattern (Copet et al. 2023) into Q music and Q motion respectively, resulting shape K S , where S = S + K 1. We then concatenate them in time dimension into Qinput of the shape K 2S as the input to the transformer decoder. The model s output is transformed back to the normal pattern for loss calculation. Training on music-motion joint generation, we adopt the predict-next-token objectives for both music and motion tokens in each forward pass: t=1 log P h Qmusic t |Qmusic