# when_counterpoint_meets_chinese_folk_melodies__7f067521.pdf When Counterpoint Meets Chinese Folk Melodies Nan Jiang Sheng Jin Zhiyao Duan Changshui Zhang Institute for Artificial Intelligence, Tsinghua University (THUAI), State Key Lab of Intelligent Technologies and Systems, Beijing National Research Center for Information Science and Technology (BNRist), Department of Automation, Tsinghua University, Beijing, China Department of Electrical and Computer Engineering, University of Rochester jiangn15@mails.tsinghua.edu.cn js17@mails.tsinghua.edu.cn zhiyao.duan@rochester.edu zcs@mail.tsinghua.edu.cn Counterpoint is an important concept in Western music theory. In the past century, there have been significant interests in incorporating counterpoint into Chinese folk music composition. In this paper, we propose a reinforcement learning-based system, named Folk Duet, towards the online countermelody generation for Chinese folk melodies. With no existing data of Chinese folk duets, Folk Duet employs two reward models based on out-of-domain data i.e., Bach chorales, and monophonic Chinese folk melodies. An interaction reward model is trained on the duets formed from outer parts of Bach chorales to model counterpoint interaction, while a style reward model is trained on monophonic melodies of Chinese folk songs to model melodic patterns. With both rewards, the generator of Folk Duet is trained to generate countermelodies while maintaining the Chinese folk style. The entire generation process is performed in an online fashion, allowing real-time interactive human-machine duet improvisation. Experiments show that the proposed algorithm achieves better subjective and objective results than the baselines. 1 Introduction Counterpoint [31], as an important and unique concept in Western music theory, is commonly identified in the works composed in the Baroque and Classical periods. It refers to the mediation of two or more musical voices into a meaningful and pleasing whole, where the role of these voices are somewhat equal. Chinese folk melodies, on the other hand, are typically presented in a monophonic form or with accompaniments that are less melodic [28], with few exceptions [45]; they feature unique melodic patterns and are mostly based on the pentatonic scale. In the past century, some renowned Chinese composers, e.g., Xian Xinghai and He Luting, have explored incorporating counterpoints and fugues to Chinese music [38] and these attempts have brought several successful choral and orchestral works such as Yellow River Cantata and Buffalo Boy s Flute . However, systematic theories and broader influences on the general public of integrating counterpoint with Chinese folk melodies are lacking. This motivates our work. In this paper, we propose a system named Folk Duet to automatically generate countermelodies for Chinese folk melodies, following the counterpoint concept in Western music theory while maintaining the Chinese folk style. Instead of offline harmonization, we make the generation in an online (causal) fashion, where the countermelody is generated in real time as the input melody streams in, for a broader research scope and better engagement of users [2]. We believe that this is an innovative idea and it would make broader impacts on two aspects: 1) It would facilitate music cultural exchanges between the West and the East at a much larger scale through automatic music generation and style 34th Conference on Neural Information Processing Systems (Neur IPS 2020), Vancouver, Canada. fusion, and 2) it would further the idea of collaborative counterpoint improvisation between a human and a machine [2] to music traditions where such counterpoint interaction was less common. Automatically harmonizing a melody or a bass line into multi-part music has been investigated for decades, but primarily on Western classical music, especially J.S. Bach chorales. Early methods were rule-based [17, 43, 35], where counterpoint rules were encoded to guide the generation process. Later, statistical machine learning based methods were proposed to learn counterpoint interaction from training data directly. For Chinese folk melodies, however, both systematic counterpoint rules and large repertoires of countermelodies are lacking, making these existing approaches infeasible. The key idea of our proposed method is to employ reinforcement learning (RL) to learn and adapt counterpoint patterns from Western classical music while maintaining the melodic style of Chinese folk melodies during countermelody generation. Specifically, we design two data-driven reward models for this task. The first one is trained on the outer parts (the soprano and the bass line) of Bach chorales to model generic counterpoint interaction. It models the mutual information between the outer parts of Bach chorales. The other rewarder is trained by maximum entropy Inverse Reinforcement Learning (IRL) on monophonic Chinese folk songs to model the melodic style. The reinforcement learning scheme then fuses the generic counterpoint interaction patterns with the melodic style during countermelody generation. The proposed method works in an online fashion, aiming to support human-machine collaborative music improvisation. Compared to offline methods that can iteratively revise the generated music content (e.g., using Gibbs sampling) [7, 19, 24, 47], online harmonization is more difficult to generate globally coherent and pleasing content. In this regard, the proposed method follows RL-Duet [27], which has shown that reinforcement learning helps to improve the global coherence in online harmonization in the style of Bach chorales. The proposed work differs from RL-Duet on two aspects: 1) It generates the countermelody in the Chinese folk style instead of the Bach chorale style on which the counterpoint rewarder is trained, and 2) it uses an iteratively updated rewarder (i.e., IRL) instead of a fixed rewarder to achieve the knowledge transfer and style fusion. To validate the proposed Folk Duet system, we compare it with two baselines: RL-Duet and a maximum likelihood estimation (MLE) baseline trained on outer parts of Bach chorales. Objective experiments show that Folk Duet generates countermelodies that show closer statistics to the Chinese folk melodies in the dataset and better key consistency with the melody. Subjective listening tests also show that the proposed approach achieves higher ratings than the baseline method, in terms of both harmonicity between the two parts and the maintenance of the Chinese folk style. Our main contributions are summarized as follows: 1) To our best knowledge, this is the first attempt to fuse Western counterpoint interaction with the Chinese folk melody style in countermelody generation. 2) The proposed RL approach learns to generate countermelodies in real time for Chinese folk melodies purely from out-of-domain data (i.e., Bach chorales, and Chinese folk melodies) instead of Chinese folk duets that are scarce. 3) The proposed counterpoint rewarder learns and transfers counterpoint interaction from mutual information between the two outer parts of Bach chorales to the Chinese folk countermelody task. 2 Related Work 2.1 Automatic Counterpoint Composition There has been significant progress in automatic counterpoint composition. For example, David Cope developed the famous EMI (Experiments in Musical Intelligence) program [9], which employs elaborately designed musical rules to imitate a composer s style and create new melodies. Farbood et al. [12] implemented the first HMM algorithm to investigate the first-species counterpoint to write counterpoints. Herremans et al. [20, 21] used the tabu search to optimize the objective function involving constraints for counterpoint generation. J. S. Bach s chorales have been the main corpus in computer music for tackling full-fledged counterpoint generation. Several approaches have been proposed. Earlier work uses rule-based [3], constraint-based [39] or statistical models [8], while recent work mainly uses neural networks such as Bach Bot [34], Deep Bach [19], Coconet [24, 25], and RL-Duet [27]. 2.2 Reinforcement Learning for Music Generation Recent works have explored reinforcement learning (RL) for automatic music generation [15, 6, 30, 18, 26]. RL-based approaches consider the long-term consistency and are able to explicitly apply nondifferentiable musical rules or preferences. In Sequence Tutor [26], hand-crafted rules are explicitly modeled as the reward function to train the RNN. However, it is difficult, if not impossible, to achieve a balance among these different rules especially when they conflict with each other. More recently, RL-Duet [27] was proposed to ensemble multiple data-driven reward models to get a comprehensive reward agent. Nevertheless, it still uses one rule to punish repetitions of generated notes. In our work, we do not rely on hand-crafted rules or ensemble methods to obtain a comprehensive reward function. Instead, we use inverse reinforcement learning (IRL), which learns to infer a reward function underlying the observed expert behavior [37]. IRL and RL are usually trained alternately. such that the reward function could be dynamically adjusted to suit the reinforcement learning process. Such alternate training is similar to the generative adversarial networks (GAN). As some works draw an analogy between the theories of GAN and IRL [13, 22] from the optimization perspective, IRL could address the reward sparsity and mode collapse problems in GAN [42]. 2.3 Unsupervised Style Transfer Inspired by the recent progress of the visual style transfer [16], music style transfer is attracting more attention. Music style transfer is challenging because music is semantically abstract and multi-dimensional in nature [4]. According to [11], music style transfer can be categorized into 1) timbre style transfer for sound textures [44, 14], 2) performance style transfer for music performance control [36], and 3) composition style transfer for music generation [40, 32, 50, 29]. Regarding approaches, some methods achieve style transfer by disentangling the pitch content and rhythm styles in the latent music representations [40, 49], while others apply explicit rules to modify monophonic [50] or polyphonic melodies [29]. Our work can be viewed to transfer the Chinese folk style from the melody to the generated countermelody. 3 Methodology 3.1 Music Encoding In our online countermelody generation task, each music piece has two parts: 1) the Chinese folk melody H = [h0, h1, ..., h KH], which is composed by a human, and 2) the machine-generated countermelody M = [m0, m1, ..., m KM ] which harmonizes H. Here we use a note-based representation, and the mk and hk are the k-th note in the machine and human parts, respectively. As shown in Fig. 2(a), a note mk is represented with two items, m(p) k and m(d) k for pitch and duration, respectively. This is different from most existing work, which uses a grid-based representation with time being quantized into small beat subdivisions (e.g., 16th notes [19]), and a note is then represented by a pitch-onset token followed by a sequence of hold tokens. In this work, we prefer a note-based representation for two reasons: 1) Note-based representations are closer to how humans compose and perceive music. 2) From the optimization perspective, note-based representations are more compact and do not have the token imbalance issue that grid-based representations have due to the much larger amount of hold tokens than pitch-onset tokens. One challenge we face in our note-based representation is that it does not provide a natural synchronization between the two parts as grid-based representations do. To address this, we also encode the within-measure position bk of each note mk. This bk is computed by modulating the note onset time tk by 16, assuming a 4/4 time signature and a position resolution of 16-th notes. The onset time tk uses metadata to make the generator have the concept of the beat. It is noted that although many melodies do not follow the 4/4 time signature, this simple approach achieves our goal of synchronization between the two parts. 3.2 Framework For our task of online countermelody generation for Chinese folk melodies, we intend to achieve a globally coherent and pleasing interaction between the two parts borrowing the Western counterpoint Figure 1: Folk Duet system framework. In each step k, the generator generates a new note mk in the machine part based on its observation of the state Sk, which contains all past notes input from the human and generated by the machine. Then the two rewarders evaluate mk based on their observations of Sk. The style-rewarder is trained alternately with the generator, using inverse reinforcement learning, while the inter-rewarder is pre-trained on Bach chorales and then fixed during the training of the generator. concept, while maintaining the Chinese folk style in the generated countermelody. We propose to achieve this using reinforcement learning. Fig. 1 illustrates the framework of our proposed system, named Folk Duet. It contains a generator and two rewarders. The inter-rewarder models the counterpoint interaction in Western music, while the style-rewarder models the melodic pattern of Chinese folk songs. The training of the generator and the style-rewarder are alternated. The generator is trained through reinforcement learning with rewards provided by the two rewarders, while the style-rewarder is updated using the monophonic folk melodies in the training set (i.e., the human part) and their generated countermelodies (i.e., the machine part) using maximum entropy inverse reinforcement learning. Its learning objective is to infer the reward function that underlies the demonstrated expert behavior (i.e., the human part). The inter-rewarder measures the degree of interaction between human and machine parts through a mutual information informed measure. This measure is computed by Bach-HM and Bach-M models, both of which are maximum likelihood models pre-trained on duets extracted from outer parts of Bach chorales. Different from the stylerewarder, the inter-rewarder is fixed during the reinforcement learning of the generator. Figure 2: The note-based representation and the architectures of Generator and Style-Rewarder. The architectures of the two models in Inter-Rewarder are contained in the supplementary material. p-emb, d-emb and b-emb represent pitch/duration/beat embedding modules, respectively. GRU represents the Gate Recurrent Unit [5], and fc stands for the fully-connected layer. 3.3 Generator The generator generates the next note mk using its policy π(mk|O(g) k ), where O(g) k is its observation of the state Sk. Its training objective is to maximize the expected long-term discounted reward EπR. The long-term discounted reward of note mk is Rk = PT +k j=k γj k(r(s) j + αr(i) j ), where r(s) j and r(i) j are the two immediate rewards for the future note mj from the style-rewarder and the inter-rewarder, respectively, and α is a weight factor. In each training iteration, we first randomly pick one monophonic folk melody from the training data as the human part H = [h0, h1, ..., h KH]. Before generating the k-th note of the machine part, the generator s observation O(g) k contains the already generated machine part {mi}k 1 i=0 , the already completed notes in the human part {hi}k 1 i=0 , and just the pitch of the currently being played note in the human part hk . It is noted that the duration of hk is not part of the observation, because in real-time human-machine interactive generation, the generator does not know when the note will end. The generator s observation strictly obeys the causality constraint. For notes in the human part that have ended strictly before the onset of the to-be-generated machine s note, it observes both the pitch and duration, while for the currently being played human s note, it only observes the pitch but not the duration. The human s note to be started at the same time as the machine note will not be observed whatsoever. The architecture of the generator is illustrated in Fig. 2(c) and is detailed in the supplementary material. It outputs the probability of action mk conditioned on the current observation O(g) k , i.e., π(mk|O(g) k ), and the state value vk. We use the actor-critic algorithm with generalized advantage estimator (GAE) [41] to train the generator. 3.4 Style Rewarder The style rewarder aims to capture the Chinese folk melodic style. We employ the maximum entropy inverse reinforcement learning (IRL) [46] to infer a reward function underlying the demonstrated expert behavior, i.e., the Chinese folk melodies. As there are no existing data of Chinese folk duets, the Style-Rewarder is pre-trained on the monophonic Chinese folk melodies. We then use IRL and RL to alternatively update the Style-Rewarder and the generator.Following [46], the probability of a note sequence τ = {nk}K k=0 is proportional to the exponential of the reward along the trajectory τ. Z exp R(τ), (1) where Z = R τ exp (R(τ)) dτ is the partition function, R(τ) = P Sk,nk τ r(s)(O(s) k , nk) is the accumulated rewards of path τ parameterized by the Style-Rewarder R, and O(s) k is the Style Rewarder s observation of state Sk. As in [42], to maximize the likelihood of the expert sequences (monophonic folk melodies), we have the following optimization gradient for the Style-Rewarder: Jr = Eτ pexpert log p(τ) = Eτ pexpert R(τ) 1 τ exp (R(τ)) R(τ)dτ = Eτ pexpert R(τ) Eτ p(τ) R(τ) i=1 R (τi) 1 P j=1 wj R τ j . The last step approximation replaces sampling τ p(τ) with sampling τ from the generator, using importance sampling weight wj exp(R(τ j)) qθ(τ j) , where qθ is the generation probability. The architecture of the Style-Rewarder is illustrated in Fig. 2(b) and detailed in the supplementary material. Its observation O(s) k of the state S(s) k is only the already generated machine part. And it outputs the style reward r(s) k . 3.5 Inter-Rewarder As we do not have access to Chinese folk counterpoint works, we use the paired outer parts of Bach chorales to learn a reward function about counterpoint interaction. We propose to measure this interaction by using mutual information, which has been introduced as an objective function that measures the mutual dependence [1, 33]. For two random variables X and Y , their mutual information is defined as the difference of entropy of Y before and after the value of X is known, i.e., I(X, Y ) = H(Y ) H(Y |X). It describes the amount of information shared between the two variables. A coherent counterpoint piece is expected to share an appropriate amount of information among its parts. The amount is neither too high (e.g., identical parts) or too low (e.g., irrelevant parts. The key to this rewarder is to know what amount is appropriate. Let random variables (X, Y ) be a duet that follows the joint distribution of duets extracted from outer parts of Bach chorales, and let Xi = {x(i) k }Kxi k=0 and Yi = {y(i) k }Kyi k=0 be the i-th sample from this Bach chorale duet dataset, then we have I(X, Y ) = X X,Y P(X, Y ) log P(X, Y ) P(X)P(Y ) X,Y P(X, Y ) [log P(Y |X) log P(Y )] X Xi,Yi PX,Y [log P(Yi|Xi) log P(Yi)] k=1 P(y(i) k |Xi, y(i) t