# how_does_it_sound__03a15e55.pdf How Does it Sound? Generation of Rhythmic Soundtracks for Human Movement Videos Kun Su Xiulong Liu Eli Shlizerman Figure 1: Rhythmic Net: Given an input of a silent human movement video, Rhythmic Net generates a soundtrack for it. One of the primary purposes of video is to capture people and their unique activities. It is often the case that the experience of watching the video can be enhanced by adding a musical soundtrack that is in-sync with the rhythmic features of these activities. How would this soundtrack sound? Such a problem is challenging since little is known about capturing the rhythmic nature of free body movements. In this work, we explore this problem and propose a novel system, called Rhythmic Net , which takes as an input a video with human movements and generates a soundtrack for it. Rhythmic Net works directly with human movements, by extracting skeleton keypoints and implementing a sequence of models translating them to rhythmic sounds. Rhythmic Net follows the natural process of music improvisation which includes the prescription of streams of the beat, the rhythm and the melody. In particular, Rhythmic Net first infers the music beat and the style pattern from body keypoints per each frame to produce the rhythm. Next, it implements a transformerbased model to generate the hits of drum instruments and implements a U-net based model to generate the velocity and the offsets of the instruments. Additional types of instruments are added to the soundtrack by further conditioning on generated drum sounds. We evaluate Rhythmic Net on large scale video datasets that include body movements with inherit sound association, such as dance, as well as in the wild internet videos of various movements and actions. We show that the method can generate plausible music that aligns with different types of human movements. 1 Introduction Rhythmic sounds are everywhere, from raindrops falling on surfaces, to birds chirping, to machines generating unique sound patterns. When sounds accompany visual scenes, they enhance the perception of the scene by complementing it with additional cues such as semantic association of events, means of communication, drawing attention to parts of the scene, and many more. For visual scenes These authors contributed equally. Department of Electrical & Computer Engineering, University of Washington, Seattle, USA. Department of Applied Mathematics, University of Washington, Seattle, USA Corresponding author: shlizee@uw.edu 35th Conference on Neural Information Processing Systems (Neur IPS 2021). Figure 2: System Overview of Rhythmic Net. Keypoints are extracted from human activity video and are processed through Video2Rhythm stage to generate the rhythm. Afterwards Rhythm2Drum converts the rhythm to drum performance. In the last step, Drum2Music component adds additional instrument tracks on top of the drum track. that include activity of people, rhythmical music that is in-sync with the rhythm of body movements can emphasize the actions of the person and enhance the perception of the activity [1, 2]. Indeed, to support such synchrony, a usual practice is that a musical soundtrack is chosen manually in professionally edited videos. Drum instruments serve as the fundamental part in music by generating the underlying leading rhythm patterns. While drum instruments vary in shape, form, and mechanics, their main purpose is to set the essential rhythm for any music. Indeed, drums are known to have existed from around 6000 BC, and even beforehand there were instruments based on principle of hitting two objects and generating sounds [3]. On top of drum patterns, additional instruments add secondary patterns and melody, creating rich multifaceted music. In modern music, in composition and improvisation, it is also the case that composers would start a new musical piece by designing the rhythm for the corresponding drum track. As the piece evolves, additional accompanying instruments tracks are gradually superimposed on top of the drum track to produce the final music. Inspired by the possibility of associating rhythmic soundtracks to videos, in this work we explore automatic generation of rhythmic music correlated with human body movements. We follow similar music composition and improvisation steps as in music improvisation by first generating the rhythm of the music that is strongly correlated with the beat and movements patterns. Such rhythm can then be then used to generate novel drums music accompanying the body movements. With the rhythm being inferred, we follow further steps of music improvisation and add new instruments (piano and guitar) tracks to enrich the music. In summary, we address the challenge of generating a rhythmic soundtrack for a human movement video by proposing a novel pipeline named Rhythmic Net , which translates human movements from the domain of video to rhythmic music with three sequential components: Video2Rhythm, Rhythm2Drum, and Drum2Music. In the first stage of Rhythmic Net, given a human movement video, we extract the keypoints from the video and use a spatio-temporal graph convolutional network [4] in conjunction with transformer encoder [5] to capture motion features for estimation of music beats. Since music beats are periodic and there are various visual changes occurring in human movements, we propose an additional stream, called the style, which captures fast movements. The combination of the two streams constitutes the movements rhythm and guides music generation in the next stage, called Rhythm2Drum. This stage includes an encoder-decoder transformer that given the rhythm, generates the drums performance hits and a U-net [6] which subsequently generates drums velocities and offsets. We find that these two stages are critical for generation of quality drum music. In the last stage, called Drum2Music, we complete the drum music by adopting an encoder-decoder architecture using transformer-XL [7] to generate a music track of either piano or guitar conditioning on the generated drum performance. An overview of Rhythmic Net is shown in Fig. 2. Our main contributions are: (i) To the best of our knowledge, we are the first to generate a novel musical soundtrack that is in-sync with human activities. (ii) We introduce an entire pipeline, named Rhythmic Net , which implements three stages to complete the transformation. (iii) Rhythmic Net is robust and generalizable. Experiments on datasets of large-scale dance videos and in the wild internet videos show that music generated by Rhythmic Net will be consistent with human body movements in videos. 2 Related Work Generation of sounds for a video is a challenging problem since it aims to relate two signals that are indirectly correlated. It belongs to the class of problems of Audio-Visual learning, which deals with exploration and leveraging of the correlation of both audio and video for tasks such as audiovisual correspondence [8, 9, 10, 11], video sound separation [12, 13, 14, 15], audio-visual event localization [16], transformations of audio to body movements [17, 18, 19], lips movements [20] and talking faces [21, 22, 23]. Audio-visual systems are usually developed by using multi-modal learning techniques which have been shown effective in action recognition [24, 25], speech question answering [26, 27, 28, 29, 30], 3D world physical simulation [31], and medical images analysis [32, 33, 34, 35, 36]. Several approaches were proposed for the relation of sounds to a video. A deep learning approach showed the potential of such application by proposing a recurrent neural network to predict the audio features of impact sounds from videos. The approach was able to produce a waveform from these features [37]. In a subsequent work, a conditional generative adversarial network was proposed to achieve cross-modal audio-visual generation of musical performances [38]. In both methods, single image was used as an input, and the network performed supervision on instrument classes to generate a low-resolution spectrogram. Concurrently, for natural sounds, a Sample RNN-based method [39] has been introduced to generate sounds such as baby crying, water flowing, given a visual scene. This approach was enhanced by an audio forwarding regularizer that considers the real sound as an input and outputs bottle-necked sound features which provide stronger supervision for natural sound predictions only from visual features [40]. Compared to natural sounds with relatively simple characteristics, music contains more complex elements. While such problem is more challenging, the possibility to correlate movement and sounds was shown by a rule-based sensor system which succeeded to convert sensed motion to music notes [41]. In recent years there has been remarkable progress in the generation of music from video. An interactive background music synthesis algorithm guided by visual content was introduced to synthesize dynamic background music for different scenarios [42]. The method, however, relied on reference music retrieval and could not generate new music directly. Direct music generation approaches have been developed for videos capturing a musician playing an instrument. A Res Net-based method was proposed to predict the pitch and the onsets events, given video frames of top-view videos of pianists playing the piano [43]. Later, Audeo [44] demonstrated the possibility to transcribe video to high-quality music. While the results of such methods are promising, the generation is limited to a single instrument. Thereby, Foley Music [45] proposed a Graph-Transformer network to generate Midi events from body keypoints and achieved convincing synthesized music from Midi. Further, Multi-instrumentalist Net [46] showed generation of music waveform of different instruments in an unsupervised way. While these approaches demonstrate the possibilities of generating music from videos, the videos need to contain solid visual cues such as instruments to indicate the types of music being generated. It still remains unclear whether it is possible to generate music when such visual cues do not exist. With respect to human movement, this would be extracting the characteristics of the movement and attempting to match music with them. In this regard, a recent novel approach of dance beat tracking was proposed [47]. The approach is aimed at detecting the characteristics of musical beats from a video of a dance by using visual information only. Inspired by this work, we design a novel methodology to estimate in a precise way musical characteristics, such as beats, from movements and utilize them to improvise new music. There has been vital recent progress in the generation of music from its representations, such as symbolic representations, as well. In particular, Musical Instrument Digital Interface (Midi) representation has been shown to be useful in modeling and generating music. Initial works converted Midi into piano-roll representation and used generative adversarial networks [48] or variational autoencoders [49, 50] to generate new music. A limitation of the piano-roll is that it may result in memory inefficiency when the length of the music is too long. In order to address this limitation, eventbased representation has been proposed and was shown to be a useful and efficient representation in modeling music [51, 52, 53]. While the event-based representation enabled models to obtain convincing generated results, it lacks metrical structure, leading to unsteady beats in the generated samples. Thereby, recently, a new representation called Remi was proposed to impose a metrical structure in the input data so that the models can include awareness of the beat-bar-phrase hierarchical structure in the music [54]. In our work, we utilize the Remi representation by converting the Midi into Figure 3: Detailed schematics of the components in the Video2Rhythm stage. Remi in the Drum2Music stage. While the methods mentioned above generate unconditional music, it was shown to be possible to constrain music generation. For example, it was proposed to constrain generative models to sample for predefined attributes [55]. Systems such as Jukebox [56] and Muse Net [57] showed the possibility of generating music based on user preferences which correspond to network model specifically trained with labeled tokens as a conditioning input. Furthermore, a Transformer autoencoder has been proposed to aggregate encoding of Midi data across time to obtain a global representation of style from a given performance. Such a global representation can be used to control the style of the music [58]. Additional models have been proposed, such as a model capable of generating kick drums given conditional signals including beat, downbeat, onset of snare and Bass [59]. In Rhythmic Net, conditioning additional music instruments on the drum track is expected to provide a richer soundtrack. For this purpose, in the Drum2music stage, we utilize the Transformer autoencoder and consider the drum track as the conditioning input and generate the track of another musical instrument, such as piano or guitar. Rhythmic Net includes three sequential components: 1) Association of rhythm with human movements (Video2Rhythm), 2) Generation of drum track from rhythm (Rhythm2Drum), 3) Adding instruments to the drum track (Drum2Music). We describe the details of each stage below. Video2Rhythm. We decompose the rhythm into two streams: beats and style. We propose a novel model to predict music beats and a kinematic offsets based approach to extract style patterns from human movements. Music Beats Prediction. Beat is a binary periodic signal determined by fixed tempo, and it is obtained by music beat prediction network, which learns the beat by pairing body keypoints with ground truth music beats in a supervised way. To predict regular music beats from human body movements, we extract 2D skeleton keypoints via the Open Pose framework [60] and perform first-order difference to obtain the velocity for each video. Motion sequences is considered as three dimensional tensor X RV T 2 where V is the number of keypoints, T is the number of frames, and the last dimension indicates the 2D coordinates. We formulate the prediction of music beats as a temporal binary classification problem: Given the skeleton keypoints X, we aim to generate the output with the same length Y RT , where each frame is classified into beat (y = 1) or non-beat (y = 0). We encode the keypoints using a spatio-temporal graph convolutional neural network (ST-GCN) [4]. Such encoding represents the skeleton sequence as an undirected graph G = (V, E), where each node vi V corresponds to a key point of the human body and edges reflect the connectivity of body keypoints. The sequence passes through a spatial GCN to obtain the features at each frame independently, and then a temporal convolution is applied to the features to aggregate the temporal cues. The encoded motion features are then represented as P = AXWSWT RV Tv Cv, where X is the input, A RV V is the adjacency of matrix of the graph defined based on the body keypoints connections. WS and WT are the weight matrices of spatial graph convolution and temporal convolution. Tv and Cv indicate the number of temporal dimension and feature channels. We obtain the final motion features P RTv Cv by averaging the node features. Given the motion feature P, we use a transformer encoder that contains a stack of multi-head selfattention layers to learn the correlation between different frames. Due to the periodicity of the music beats, we introduce two components to allow the model to capture them more accurately: 1) We adopt a relative position encoding [61] to allow attention to explicitly resolve the distance between two tokens in a sequence instead of using common positional sinusoids to represent timing information. This encoding is critical for modeling the timing in music where relative differences matter more than their absolute values [52]. 2) We use temporal self-similarity matrix of motion features (SSM), which has been shown effective in human action recognition in regularization of the transformer and counting the repetitions of periodic movements [62, 63, 64]. SSM can be constructed by computing all pairwise similarities Sij = f(Pi, Pj) between pairs of frame-level motion features Pi and Pj, where f( ) is the similarity function. We use the negative of the squared euclidean distance as the similarity function, f(a, b) = ||a b||2, followed by taking softmax over the time axis. SSM has only one channel and it goes through a convolution layer ˆS = Conv(S) and then added to every attention head in the self attention component implemented as Attention(Q, K, V ) = Softmax(QKT + ˆS + R Dk )V, where Q, K, V are the standard query, key and value respectively, and R is the ordered relative position encoding for each possible pairwise distance among pairs of query and key on each head. We train the model using weighted binary cross-entropy loss that puts more weight toward the beat category to address imbalances. In comparison with previous work [47], the combination of graph representation, relative selfattention and SSM components enables the model to better capture the spatial-temporal structures in body dynamics which allows for more accurate beat estimation. The output of the network is the beat activation function; i.e., for each video frame, the model predicts its probability of being a beat frame. To obtain beat positions, we apply an algorithm based on HMM decoding proposed in [65]. Style Extraction. While beats represent the monotonic periodic pattern occurring at fixed time intervals (i.e. periodic signal), there are additional a-periodic components in the rhythm. In particular, between two music beats, there are typically various irregular movements that contribute to the rhythm. In contrast to beats, these patterns are inconsistent and it is unclear how to systematically extract such patterns from visual information. We, therefore, define an additional stream, called style, which records incidences of transitional movements of the human body, such as rapid and sudden movements. For prediction of such events, we apply a rule-based approach since the definition of style is implicit and there is no data to learn a mapping from body keypoints to transitional movements. The style is defined as a binary stream that indicates transition time points as 1 and non-transitional time points as 0. We compose the style stream by implementing several steps based on spectral analysis of kinematic offsets of the motion [66]. The first step is to compute kinematic offsets. Kinematic offsets are 1D time series signal representing the average acceleration of the human body over time. To obtain kinematic offsets, we calculate the directogram of the motion by factoring it into different angles. Given Ft(j, t) as the velocity magnitude of joint j at time t, we formulate the directogram D(t, θ) [67] as: D(t, θ) = X j Ft(j, t)1θ( Ft(j, t)), where 1θ(φ) = 1 |θ φ| 2π/Nbins 0 otherwise (1) The indicator function 1θ(φ) is used to distribute the motion of all joints into Nbins angular intervals. Then the first-order difference of the directogram is calculated to obtain the acceleration of motion across different angles. The mean acceleration in the positive direction measures motion strength (i.e., the larger the value, the more remarkable in motion strength) and corresponds to the kinematic offsets. Once kinematic offsets are obtained, in the next step we perform a Short-Time-Fourier Transform (STFT) on them to identify peaks in the change of acceleration. The highest frequency bin in STFT (out of 8) represents the most profound transitions in the signal and we use the highest frequency bin to extract the style patterns from motion. The peaks are defined as 10% top magnitudes over the duration of the video. We mark the timepoints of the peaks as 1 and other timepoints as 0. Since STFT results with low temporal resolution (due to hop-size set to 4 for efficient computation) we upsample the binary signal by the hop size to obtain a binary signal that matches the resolution of the video. The output signal is re-sampled to have the same sampling rate as the music beats. Rhythm Composition. We obtain the rhythm by adding the streams of the beats and the style into a single signal. The rhythm should correspond to the correlation of body movements with the tempo of the soundtrack. Figure 4: Detailed schematics of the components in the Rhythm2Drum stage. Figure 5: Detailed schematics of the components in the Drum2Music stage. Rhythm2Drum. The stage of Rhythm2Drum interprets the provided rhythm from previous stage into drum sounds. In this stage we follow the Groove VAE setup [50], where each drum track can be represented by three matrices: hits, velocities, and offsets. The hits represent the presence of drum onsets and is a binary matrix H RN T , where N is the number of drum instruments and T is the number of time steps (one per 16-th note). The velocities is a continuous matrix, V , that reflects how hard drums are struck, with values in the range of [0, 1]. The offsets O is also a continuous matrix and stores the timing offsets, with values in the range of [ 0.5, 0.5). These values indicate how far and in which direction each note s timing lie, relative to the nearest 16-th note. The matrices V O, and H have the same shape. Given the input rhythm sequence Y R1 T , we aim to generate the H, V and O. In contrast to Groove VAE [50], which models all three matrices simultaneously with multiple losses, we model H, V , and O smoothly in two steps using the combination of an encoder-decoder transformer [5] and a U-net [6]. In the first step, the binary rhythm is passed as an input to the transformer encoder. In the decoder, the H matrix is converted into word a sequence defined by a small vocabulary set of all possible combinations of hits, and is mapped back to a binary matrix for the final output. We observe that autoregressively learning the hits H as a word sequence, the transformer can generate more natural and diverse drum onsets. We train the transformer with the cross-entropy loss. In the second step, we add style patterns (velocity and offsets) to the onsets. Since H has the same shape as V and O, we can consider it as a transformation between 2 images of the same shape. To achieve such transformation, we adopt a U-net [6] to take the onset matrix H as an input and to generate V and O. We use Mean-Square Error (MSE) loss for U-net optimization. Finally, we convert the generated matrices H, V and O to the Midi representation to produce the drum track. Drum2Music. In this last stage we add further instruments to enrich the soundtrack. Since the drum track contains rhythmic music, we propose to condition the additional instrument stream on the generated drum track. Specifically, we propose an encoder-decoder architecture, such that the encoder receives the drum track as an input, and the decoder generates the track of another instrument. We consider the piano or guitar as the additional instruments, since these are dominant instruments. We use Remi representation [54] to represent multi-track music. Compared to the commonly-used Midilike event representation [52], the Remi representation includes information such as Tempo changes, Chord, Position, and Bar, which allow our model to learn the dependency of note events occurring at same positions across bars. For both the encoder and decoder, we adopt the transformer-XL network model which extends the transformer by including the recurrence mechanism [7]. The recurrence mechanism enables the model to leverage the information of past tokens beyond the current training segment and to look further into the history. The encoder contains stack of multi-head self-attention layers. Its output Ei can be represented as: Ei = Enc(xi, M E i ), where M E i is the encoder memory used for the i-th bar input and the encoder hidden state sequence computed in previous recurrent steps. Similarly, in the decoder, the prediction of j-th token of the i-th bar yi,j is formulated as yi,j = Dec(yi,t