# adaptive_accompaniment_with_realchords__1b99f3fb.pdf Adaptive Accompaniment with Rea Lchords Yusong Wu 1 2 Tim Cooijmans 2 Kyle Kastner 3 Adam Roberts 1 Ian Simon 1 Alexander Scarlatos 1 4 Chris Donahue 1 5 Cassie Tarakajian 1 Shayegan Omidshafiei 6 7 Aaron Courville 2 8 Pablo Samuel Castro 1 2 Natasha Jaques 1 9 Cheng-Zhi Anna Huang 1 2 8 Jamming requires coordination, anticipation, and collaborative creativity between musicians. Current generative models of music produce expressive output but are not able to generate in an online manner, meaning simultaneously with other musicians (human or otherwise). We propose Rea Lchords, an online generative model for improvising chord accompaniment to user melody. We start with an online model pretrained by maximum likelihood, and use reinforcement learning to finetune the model for online use. The finetuning objective leverages both a novel reward model that provides feedback on both harmonic and temporal coherency between melody and chord, and a divergence term that implements a novel type of distillation from a teacher model that can see the future melody. Through quantitative experiments and listening tests, we demonstrate that the resulting model adapts well to unfamiliar input and produce fitting accompaniment. Rea Lchords opens the door to live jamming, as well as simultaneous co-creation in other modalities. 1. Introduction Deep generative models produce realistic, high-quality content, and are seeing increasing integration into the creative processes of artists. However, such models tend not to be designed for the demands of live scenarios such as interactive improvisation, which requires anticipation of others intentions and adaptation to mistakes, stylistic choices and delib- 1Google Deep Mind 2Mila - Quebec AI Institute, Universit e de Montr eal 3Google 4University of Massachusetts Amherst 5Carnegie Mellon University 6Work done while at Google 7Field AI 8Canada CIFAR AI Chair 9University of Washington. Correspondence to: Yusong Wu , Cheng Zhi Anna Huang . Proceedings of the 41 st International Conference on Machine Learning, Vienna, Austria. PMLR 235, 2024. Copyright 2024 by the author(s). erate exploration. We focus on music in particular, which is inherently interactive and dynamic and revolves around anticipatory collaboration in ensemble settings. Most existing models in this space, while capable of creating expressive compositions or accompaniments, are not suited for simultaneous creation, where adaptation to and alignment with the ongoing musical structure are crucial. This paper introduces Rea Lchords, a generative model tailored for online adaptive musical accompaniment. Emulating the spontaneity of live music jamming, Rea Lchords generates chord accompaniments in response to a stream of monophonic melody notes, adapting on-the-fly to the unfolding musical narrative. Each chord must be generated without knowing in advance which melody note it will accompany. This simultaneous interaction imposes a conditional independence assumption on the joint generative process, that an online model must respect. Moreover, a model must be able to gracefully handle unfamiliar situations and unexpected changes. Likelihood models, however, suffer from exposure bias due to being trained entirely on ground truth data, and transfer poorly to the online settings where mistakes, imperfections and stylistic differences are common (see Figure 1 for an example). To address this, we use RL finetuning to improve the model with respect to reward models that consider musical coherence ( 3.2). These reward models see the entire composition and evaluate its musical coherence from various perspectives ( 3.3). Our setup bears similarities to RLHF (Ouyang et al., 2022; Jaques et al., 2019) and RLAIF (Bai et al., 2022; Lee et al., 2023), however our reward models are trained through self-supervision rather than human labeling. Finally, we combine RL finetuning with knowledge distillation (Agarwal et al., 2023; Zhou et al., 2023) in a novel way, distilling from a teacher that can see the future into a student that cannot, hence forcing anticipation ( 3.4). We develop key algorithmic components (Figure 2) needed to produce an online adaptive accompaniment model that is amenable to interactive use. Our contributions and findings are as follows: 1 1Listen to audio examples here: https://storage. googleapis.com/realchords/index.html. Adaptive Accompaniment with Rea Lchords Figure 1. Online models finetuned with RL are able to recover from mistakes, while models trained with MLE alone do not. We take a melody from the test set and midway introduce an abrupt transposition designed to disrupt the accompaniment model (top row). The Online MLE model predicts a bad chord (B7) and fails to adapt. Rea Lchords also predicts a bad chord (F m), but adapts quickly. Wrong chords highlighted in orange are our own judgment informed by music theory, but the overall pattern is corroborated by an objective measure of harmonic quality, averaged over many trials of this experiment (bottom row). We propose Rea Lchords, an online accompaniment generation model trained by RL finetuning. Figure 1 shows how Rea Lchords adapts to out-ofdistribution input, a necessary skill for live jamming. We leverage knowledge distillation to learn from a non-causal teacher that can see the future ( 3.4). Distillation greatly improves the quality of the model, as evidenced by the human evaluation shown in Figure 3. We further employ a novel set of self-supervised reward models to encourage musical coherence and perceptual quality ( 3.3). Based on a human listening test, we show that our reward models align closely with human preferences (Figure 3), despite being trained without human feedback ( 3.3). We demonstrate through a series of controlled experiments that without RL finetuning, models fail to adapt to mistakes and perturbations (Figure 4, 5.4). Finally, we analyze the behavior of our models in terms of domain-specific metrics (Table 1, 5.3). We find that each component in our RL finetuning methods improves the rhythmic and harmonic quality of generated accompaniments. 2. Related Work Adaptive music accompaniment systems In contrast to automatic music generative systems, accompaniment systems often take input (such as melody) from a user, and generate output that is meant to be played in synchrony to complement what the user is playing. Some of these systems are asynchronous, where the user first provides the full melody, and the system generates an accompaniment offline. Examples include My Song (Simon et al., 2008), where a user sings a melody and the system generates chords to accompany them. Most recently, Sing Song (Donahue et al., 2023) supports a very similar interaction, but generates fullband backing tracks. Both are offline systems. In contrast, online accompaniment systems need to synchronize with user actions in real-time. Score-following is a special case where the system has the score, the full context of the content of what the musician will play, but still needs to follow along and infer when to play their own part. Music Plus One (Raphael, 2010) adapts its playback speed of an orchestral recording (without the soloist) to a soloist s expressive performance. Similarly, Antescofo (Cont, 2008) follows where a soloist is in a score and triggers live electronics accordingly. Generative accompaniment systems or more generally cocreative music systems, not only have to anticipate user actions, they need to learn how to respond. Voyager (Lewis, 2003) takes a rule-based approach in how to listen, respond and generate musical material on the fly, while Omax Brothers (Assayag et al., 2006) recombines what a musician plays on-the-fly as an accompaniment but often requires another computer musician to control when it comes in and what section of material to draw from. Improte K and later DJazz (Nika & Chemillier, 2012; Nika et al., 2017) leverages a shared predefined chord progressions (such as a Jazz Standard) to coordinate the human-machine improvisation. Instead of tight synchronization, Spire Muse (Thelle Adaptive Accompaniment with Rea Lchords & Pasquier, 2021) serves as a brainstorming partner which retrieves musical responses that are more or less similar depending on if the user is in a converging or diverging phase of ideation. Recent systems based on deep neural networks have emerged. Bach Duet (Benetatos et al., 2020) trains an LSTM model using MLE for counterpoint (melody to bassline) accompaniment. Song Driver (Wang et al., 2022) focuses on online melody-to-chord accompaniment, similar to our work. To address exposure bias, Song Driver employs two MLE-trained models: a transformer model that predicts current output based on both current and past outputs, and a conditional random field (CRF) model that predicts current output based on previous context. The CRF model makes online predictions but does not use its own predictions for future contexts; instead, it relies on the transformer model for context. In contrast, our system Rea Lchords learns how to respond and in tight synchronization with user melody, by first learning interdependencies between melody and accompaniment from existing songs, and then using RL to tune the models to respond in an adaptive fashion. RL finetuning for generative models Reinforcement learning (RL) finetuning has proven effective in aligning language models with human preferences (Ouyang et al., 2022; Jaques et al., 2019) and constraints (Jaques et al., 2017), which are often unaddressed in generative pretraining. In some cases, RL finetuning has been applied to enhance music generation models (Jaques et al., 2017; Jiang et al., 2020b). Most closely related to our work is RL-Duet (Jiang et al., 2020b), which considers a similar online generation setting, namely a duet between a user and an agent, both of them playing each note without knowing what the other will play. Our work provides several contributions over RL-Duet. First, RL-Duet is trained on Bach Chorales, a small dataset of approximately 400 songs following strict rules of counterpoint composition in the style of a particular composer. In contrast, our models are trained on the diverse Hooktheory dataset of 38,000 popular songs from a wide array of artists. To enable effective learning on this scale, we develop novel multiscale contrastive and discriminative reward models, and also propose a new knowledge distillation technique specifically geared toward the online generation setting. Finally, RL-Duet experiments are limited to the setting in which the RL model is primed with the first few ground-truth notes of the accompaniment, an unrealistic assumption for real-time collaborative jamming. As we will show in 5.4, our methods are able to begin jamming with the user s melody within a few beats, and adapt to sudden perturbations in the key. Our work is related to the emerging literature on Reinforcement Learning from AI Feedback (RLAIF) (Saleh et al., 2020; Bai et al., 2022; Lee et al., 2023), which mitigates the need for extensive human labeling by utilizing an AI assistant for feedback generation. We use this strategy to finetune online music language models, using an MLE model to obtain a learning signal. Recently, Agarwal et al. (2023) have shown that adding a distillation objective between the policy and a larger teacher model during RL finetuning further improves performance. Rea Lchords employs a novel knowledge distillation objective between the online policy and an offline model which can see future context, bridging the gap between online improvisational capabilities and offline musical coherence. 3. Online Musical Accompaniment We seek a generative model that can be used for interactive music accompaniment, where a user plays a melody, and the model simultaneously plays chords to support it. Accompaniment is a special case of the general setting in which two agents generate a joint sequence (x1, y1), . . . , (x T , y T ) in chronological order. At each step t, agents observe the historical material x