# efficient_neural_music_generation__17755829.pdf

Efﬁcient Neural Music Generation

Max W. Y. Lam, Qiao Tian, Tang Li, Zongyu Yin, Siyuan Feng, Ming Tu, Yuliang Ji,

Rui Xia, Mingbo Ma, Xuchen Song, Jitong Chen, Yuping Wang, Yuxuan Wang

Speech, Audio & Music Intelligence (SAMI), Byte Dance

Recent progress in music generation has been remarkably advanced by the stateof-the-art Music LM, which comprises a hierarchy of three LMs, respectively, for semantic, coarse acoustic, and ﬁne acoustic modelings. Yet, sampling with the Music LM requires processing through these LMs one by one to obtain the ﬁne-grained acoustic tokens, making it computationally expensive and prohibitive for a realtime generation. Efﬁcient music generation with a quality on par with Music LM remains a signiﬁcant challenge. In this paper, we present Me Lo Dy (M for music; L for LM; D for diffusion), an LM-guided diffusion model that generates music audios of state-of-the-art quality meanwhile reducing 95.7% to 99.6% forward passes in Music LM, respectively, for sampling 10s to 30s music. Me Lo Dy inherits the highest-level LM from Music LM for semantic modeling, and applies a novel dual-path diffusion (DPD) model and an audio VAE-GAN to efﬁciently decode the conditioning semantic tokens into waveform. DPD is proposed to simultaneously model the coarse and ﬁne acoustics by incorporating the semantic information into segments of latents effectively via cross-attention at each denoising step. Our experimental results suggest the superiority of Me Lo Dy, not only in its practical advantages on sampling speed and inﬁnitely continuable generation, but also in its state-of-the-art musicality, audio quality, and text correlation. Our samples are available at https://Efficient-Me Lo Dy.github.io/.

1 Introduction

Music is an art composed of harmony, melody, and rhythm that permeates every aspect of human life. With the blossoming of deep generative models [1 3], music generation has drawn much attention in recent years [4 6]. As a prominent class of generative models, language models (LMs) [7, 8] showed extraordinary modeling capability in modeling complex relationships across long-term contexts [9 11]. In light of this, Audio LM [3] and many follow-up works [5, 12 14] successfully applied LMs to audio synthesis. Concurrent to the LM-based approaches, diffusion probabilistic models (DPMs) [1, 15, 16], as another competitive class of generative models [2, 17], have also demonstrated exceptional abilities in synthesizing speech [18 20], sounds [21, 22] and music [6, 23].

However, generating music from free-form text remains challenging as the permissible music descriptions can be very diverse and relate to any of the genres, instruments, tempo, scenarios, or even some subjective feelings. Conventional text-to-music generation models are listed in Table 1, where both Music LM [5] and Noise2Music [6] were trained on large-scale music datasets and demonstrated the state-of-the-art (SOTA) generative performances with high ﬁdelity and adherence to various aspects of text prompts. Yet, the success of these two methods comes with large computational costs, which would be a serious impediment to their practicalities. In comparison, Moûsai [23] building upon DPMs made efﬁcient samplings of high-quality music possible. Nevertheless, the number of their demonstrated cases was comparatively small and showed limited in-sample dynamics. Aiming for a feasible music creation tool, a high efﬁciency of the generative model is essential since it facilitates interactive creation with human feedback being taken into account as in [24].

37th Conference on Neural Information Processing Systems (Neur IPS 2023).

Table 1: A comparison of Me Lo Dy with conventional text-to-music generation models in the literature. We use AC to denote whether audio continuation is supported, FR to denote whether the sampling is faster than real-time on a V100 GPU, VT to denote whether the model has been tested and demonstrated using various types of text prompts including instruments, genres, and long-form rich descriptions, and MP to denote whether the evaluation was done by music producers.

Model Prompts Training Data AC FR VT MP

Moûsai [23] Text 2.5k hours of music 3 3 7 7 Music LM [5] Text, Melody 280k hours of music 3 7 3 7 Noise2Music [6] Text 340k hours of music 7 7 3 7

Me Lo Dy (Ours) Text, Audio 257k hours of music1 3 3 3 3

While LMs and DPMs both showed promising results, we believe the relevant question is not whether one should be preferred over another but whether we can leverage both approaches with respect to their individual advantages, e.g., [25]. After analyzing the success of Music LM, we leverage the highest-level LM in Music LM, termed as semantic LM, to model the semantic structure of music, determining the overall arrangement of melody, rhythm, dynamics, timbre, and tempo. Conditional on this semantic LM, we exploit the non-autoregressive nature of DPMs to model the acoustics efﬁciently and effectively with the help of a successful sampling acceleration technique [26]. All in all, in this paper, we introduce several novelties that constitute our main contributions:

1. We present Me Lo Dy (M for music; L for LM; D for diffusion), an LM-guided diffusion

model that generates music of competitive quality while reducing 95.7% and 99.6% iterations of Music LM to sample 10s and 30s music, being faster than real-time on a V100 GPU. 2. We propose the novel dual-path diffusion (DPD) models to efﬁciently model coarse and ﬁne

acoustic information simultaneously with a particular semantic conditioning strategy. 3. We design an effective sampling scheme for DPD, which improves the generation quality

over the previous sampling method in [23] proposed for this class of LDMs. 4. We reveal a successful audio VAE-GAN that effectively learns continuous latent representa-

tions, and is capable of synthesizing audios of competitive quality together with DPD.

2 Related Work

Audio Generation Apart from the generation models shown in Table 1, there are also music generation models [28, 29] that can generate high-quality music samples at high speed, yet they cannot accept free-form text conditions and can only generate single-genre music, e.g., techno music in [29]. There also are some successful music generators in the industry, e.g. Mubert [30] and Riffusion [31], yet, as analyzed in [5], they struggled to compete with Music LM in handling free-form text prompts. In a more general scope of audio synthesis, some promising text-to-audio synthesizers [12, 21, 22] trained with Audio Set [32] also demonstrated the ability to generate music from free-form text, but the musicality of their samples is limited.

Acceleration of Autoregressive Models Wave Net [33] is a seminal work that demonstrates the capability of autoregressive (AR) models in generating high-ﬁdelity audio. It comes with the drawback of extremely high computational cost in sampling. To improve its practical feasibility, Parallel Wave Net [34] and Wave RNN [35] were separately proposed to accelerate Wave Net. With a similar goal, our proposed Me Lo Dy can be viewed as an accelerated variant of Music LM, where we replace the last two AR models with a dual-path diffusion model. Parallel to our work, Sound Storm [36] also exceedingly accelerates the Audio LM with a mask-based non-AR decoding scheme [37]. While it is applicable to Music LM, the sound quality of this model is still limited by the bitrate of the neural codec. In comparison, the proposed diffusion model in Me Lo Dy operates with continuousvalued latent vectors, which by nature can be decoded into music audios of higher quality.

1We focus on non-vocal music data by using an audio classiﬁer [27] to ﬁlter out in-house music data with vocals. Noticeably, generating vocals and instrumental music simultaneously in one model is defective even in the SOTA works [5, 6] because of the unnaturally sound vocals. While this work aims for generating production-level music, we improve the ﬁdelity by reducing the tendency of generating vocals.

Network Architecture The architecture designed for our proposed DPD was inspired by the dual-path networks used in the context of audio separation, where Luo et al. [38] initiated the idea of segmentation-based dual-path processing, and triggered a number of follow-up works achieving the state-of-the-art results [39 43]. Noticing that the objective in diffusion models indeed can be viewed as a special case of source separation, this kind of dual-path architecture effectually provides us a basis for simultaneous coarse-and-ﬁne acoustic modeling.

3 Background on Audio Language Modeling

This section provides the preliminaries that serve as the basis for our model. In particular, we brieﬂy describe the audio language modeling framework and the tokenization methods used in Music LM.

3.1 Audio Language Modeling with Music LM

Music LM [5] mainly follows the audio language modeling framework presented in Audio LM [3], where audio synthesis is viewed as a language modeling task over a hierarchy of coarse-to-ﬁne audio tokens. In Audio LM, there are two kinds of tokenization for representing different scopes of audio:

Semantic Tokenization: K-means over representations from SSL, e.g., w2v-BERT [44];

Acoustic Tokenization: Neural audio codec, e.g., Sound Stream [45].

To better handle the hierarchical structure of the acoustic tokens, Audio LM further separates the modeling of acoustic tokens into coarse and ﬁne stages. In total, Audio LM deﬁnes three LM tasks: (1) semantic modeling, (2) coarse acoustic modeling, and (3) ﬁne acoustic modeling.

We generally deﬁne the sequence of conditioning tokens as c1:Tcnd := [c1, . . . , c Tcnd] and the sequence of target tokens as u1:Ttgt := [u1, . . . , u Ttgt]. In each modeling task, a Transformer-decoder language model parameterized by is tasked to solve the following autoregressive modeling problem:

p (u1:Ttgt|c1:Tcnd) =

p (uj|[c1, . . . , c Tcnd, u1, . . . , uj 1]), (1)

where the conditioning tokens are concatenated to the target tokens as preﬁxes. In Audio LM, semantic modeling takes no condition; coarse acoustic modeling takes the semantic tokens as conditions; ﬁne acoustic modeling takes the coarse acoustic tokens as conditions. The three corresponding LMs can be trained in parallel with the ground-truth tokens, but need to be sampled sequentially for inference.

3.1.1 Joint Tokenization of Music and Text with Mu Lan and RVQ

To maintain the merit of audio-only training, Music LM relies on joint audio-text embedding model, termed as Mu Lan [46], which can be individually pre-trained with large-scale music data and weaklyassociated, free-form text annotations. This Mu Lan model is learned to project the music audio and its corresponding text description into the same embedding space such that the paired audio-text embeddings can be as close as possible. In Music LM, the embeddings of music and text are tokenized using a separately learned residual vector quantization (RVQ) [45] module. Then, to generate music from a text prompt, Music LM takes the Mu Lan tokens from the RVQ as the conditioning tokens in the semantic modeling stage and the coarse acoustic modeling stage, following Eq. (1). Given the preﬁxing Mu Lan tokens, the semantic tokens, coarse acoustic tokens, and ﬁne acoustic tokens can be subsequently computed by LMs to generate music audio adhering to the text prompt.

4 Model Description

The overall training and sampling pipelines of Me Lo Dy are shown in Figure 1, where, we have three modules for representation learning: (1) Mu Lan, (2) Wav2Vec2-Conformer, and (3) audio VAE, and two generative models: a language model (LM) and a dual-path diffusion (DPD) model, respectively, for semantic modeling and acoustic modeling. In the same spirit as Music LM, we leverage LM to model the semantic structure of music for its promising capability of modeling complex relationships across long-term contexts [9 11]. We also similarly pre-train a Mu Lan model

Semantic tokens

Mu Lan (audio/text tower)

Mu Lan tokens

Latent features

Semantic Modeling Acoustic Modeling

Give me a background music track suitable for

time-lapse videos.

Generated audio

Semantic tokens

Wav2Vec2-Conformer

Semantic tokens

Mu Lan (audio tower)

Mu Lan tokens

RVQ k-means

Latent features

Target audio:

Semantic Modeling

Acoustic Modeling

Encoder ! "

Training stage Sampling stage

Map to [-1, 1]

Chunk-wise resample

Map to [-1, 1]

Continuation

Audio prompt Text prompt

Noise #!! Noise #

Figure 1: The training and sampling pipelines of Me Lo Dy

to obtain the conditioning tokens. For semantic tokenization, after empirically compared against w2v-BERT [44], we employ a Wav2Vec2-Conformer model, which follows the same architecture as Wav2Vec2 [47] but employs the Conformer blocks [48] in place of the Transformer blocks. The remainder of this section presents our newly proposed DPD model and the audio VAE-GAN used for DPD model, while other modules overlapped with Music LM are described in Appendix B regarding their training and implementation details.

4.1 Audio VAE-GANs for Latent Representation Learning

To avoid learning arbitrarily high-variance latent representations, Rombach et al. [2] examined a KLregularized image autoencoder for latent diffusion models (LDMs) and demonstrated extraordinary stability in generating high-quality image [49], igniting a series of follow-up works [50]. Such an autoencoder imposes a KL penalty on the encoder outputs in a way similar to VAEs [51, 52], but, different from the classical VAEs, it is adversarially trained as in generative adversarial networks (GANs) [53]. In this paper, this class of autoencoders is referred to as the VAE-GAN. Although VAE-GANs are promisingly applied in image generation, there are limited comparable attempts in audio generation. In this work, we propose to use a similarly trained VAE-GAN for raw audio.

Speciﬁcally, the audio VAE-GAN is trained to reconstruct 24k Hz audio with a striding factor of 96, resulting in a 250Hz latent sequence. The architecture of the decoder is the same as that in Hi Fi-GAN [54]. For the encoder, we basically replace the up-sampling modules in the decoder with convolutionbased down-sampling modules while other modules stay the same. For adversarial training, we use the multi-period discriminators in [54] and the multi-resolution spectrogram discriminators in [55]. The implementation and training details are further discussed in Appendix B.

4.2 Dual-Path Diffusion: Angle-Parameterized Continuous-Time Latent Diffusion Models

The proposed dual-path diffusion (DPD) model is a variant of diffusion probabilistic models (DPMs) [1, 15, 56] in continuous-time [16, 57 59]. Instead of directly operating on raw data space x pdata(x), with reference to LDMs [2], DPD operates on a low-dimensional latent space z0 = Eφ(x), such that the audio can be approximately reconstructed from the latent vectors: x Dφ(z0), where Eφ and Dφ are the encoder and the decoder in VAE-GAN, respectively. Diffusing the latent space could signiﬁcantly relieve the computational burden of DPMs [2]. Also, sharing a similar observation with [2], we ﬁnd that audio VAE-GAN performed more stabler than other VQ-based autoencoders [45, 60] when working with the outputs from diffusion models.

Formally speaking, DPD is a Gaussian diffusion process zt that is fully speciﬁed by two strictly positive scalar-valued, continuously differentiable functions t, σt [16]: q(zt|z0) = N(zt; tz0, σ2

t I) for any t 2 [0, 1]. In the light of [58], we deﬁne t := cos( t/2) and σt := sin( t/2) to beneﬁt from some nice trigonometric properties, i.e., σt =

t (a.k.a. variance-preserving [16]). With this deﬁnition, the forward diffusion process of zt can be re-parameterized in terms of angle δ 2 [0, /2]:

zδ = cos(δ)z0 + sin(δ) , N(0, I), (2) which implies zδ gets noisier as the angle δ increases from 0 to /2.

To create a generative process, a -parameterized variational model p (zδ !|zδ) is trained to reverse the diffusion process by enabling taking any step ! 2 (0, δ] backward in angle. By discretizing /2

Stacked SRUs

Add & RMSNorm

Merge Segments

Add & RMSNorm

Semantic tokens !!, , !"!"

Learnable Embedding

Angle vector $

Learnable Slerp

Rearrange to

Coarse Path

Rearrange to

Dual-Path Blocks

Segmentation

Overlap and Add

Noisy Latent %##

Predicted Velocity &##

Repeat Segments

Network Architecture Encoding Conditions Dual-Path Blocks

Figure 2: The proposed dual-path diffusion (DPD) model

into T ﬁnite segments, we can generate z0 from z /2 N(0, I) in T sampling steps:

p (z0|z /2) =

p (zδt !t|zδt) dzδ1:T 1, δt =

i=t+1 !i, 1 t < T; 2 , t = T, (3)

where !1, . . . , !T , termed as the angle schedule, satisfy PT

t=1 !t = /2. Regarding the choice of angle schedule, Schneider et al. [23] proposed a uniform one, i.e., !t = 2T for all t. Yet, we observe that noise scheduling in the previous works [61, 62] tend to take larger steps at the beginning of the sampling followed smaller steps for reﬁnement. With a similar perspective, we design another linear angle schedule, written as

6T + 2 t 3T(T + 1), (4)

which empirically gives more stable and higher-quality results. Appendix D presents the comparison results of this linear angle schedule against the uniform schedule used in [23].

4.2.1 Diffusion Velocity Prediction

In DPD, we model the diffusion velocity at δ [58], deﬁned as vδ := dzδ

dδ . It can be simpliﬁed as:

vδ = d cos(δ)

dδ z0 + d sin(δ)

dδ = cos(δ) sin(δ)z0. (5)

When vδ is given, we can easily remedy the original sample z0 from a noisy latent zδ at any δ, since z0 = cos (δ)zδ sin(δ)vδ. This suggests vδ a feasible target for neural network prediction ˆv (zδ; c), where c generally denotes the set of conditions controlling the music generation. In Me Lo Dy, as illustrated in Figure 1, the semantic tokens u1, . . . , u TST, which are obtained from the SSL model during training and generated by the LM at inference time, are used to condition the DPD model. In our experiments, we ﬁnd that the stability of generation can be signiﬁcantly improved if we use token-based discrete conditions to control the semantics of the music and let the diffusion model learn the embedding vector for each token itself. As in [23, 58], this velocity prediction network can be effectively trained with a mean squared error (MSE) loss:

L := Ez0 pdata(z0), N (0,I),δ Uniform[0,1]

kcos(δ) sin(δ)z0 ˆv (cos(δ)z0 + sin(δ) ; c)k2

which forms the basis of DPD s training loss.

4.2.2 Multi-Chunk Velocity Prediction

With reference to [23], for long-context generation, we can incrementally appending new chunks of random noise to inﬁnitely continue audio generation. To achieve this, the velocity prediction

network needs to be trained to handle the chunked input, where each chunk exhibits a different scale of noisiness. In particular, we deﬁne the multi-chunk velocity target vtgt that comprises M chunks of velocities. Given z0, zδ, 2 RL D with L representing the length of latents and D representing the latent dimensions, we have vtgt := v1 v M, where is the concatenation operation and

vm := cos(δm) [Lm 1 : Lm, :] sin(δm)z0[Lm 1 : Lm, :], Lm :=

Here, we use the Num Py slicing syntax (0 as the ﬁrst index) to locate the m-th chunk, and we draw δm Uniform[0, /2] for each chunk at each training step to determine the noise scale. The MSE loss in Eq. (6) is then extended to

Lmulti := Ez0, ,δ1,...,δM

kvtgt ˆv ( zδ1 zδM ; c)k2

zδm := cos(δm)z0[Lm 1 : Lm, :] + sin(δm) [Lm 1 : Lm, :]. (9)

Different from the original setting where we use a global noise scale for the network input [1, 61, 63], in the case of multi-chunk prediction, we need to speciﬁcally inform the network what the noise scales are for all M chunks. Therefore, we append an angle vector δ to the set of conditions c := {u1, . . . , u TST, δ} to record the angles drawn in all M chunks aligned with the L-length input:

δ := [δ1]L1

r=1 2 RL, (10)

r=1 denotes the operation of repeating a scalar a for B times to make a B-length vector.

4.2.3 Dual-Path Modeling for Efﬁcient and Effective Velocity Prediction

To predict the multi-chunk velocity with ˆv , we propose a dual-path modeling mechanism, which plays a prime role in DPD for efﬁcient parallel processing along coarse and ﬁne paths and effective semantic conditioning. Figure 2 presents the computation procedures of ˆv , which comprises several critical modules that we present one by one below.

To begin with, we describe how the conditions {u1, . . . , u TST, δ} are processed in DPD:

Encoding Angle Vector First, we encode δ 2 RL, which records the frame-level noise scales of latents. Instead of using the classical positional encoding [1], we use a spherical interpolation [64] between two learnable vectors estart, eend 2 R256 using broadcast multiplications, denoted by :

Eδ := MLP(1) (sin(δ) estart + sin(δ) eend) 2 RL Dhid, (11)

where, for all i, MLP(i)(x) := RMSNorm(W(i)

2 GELU(x W(i)

2 ) projects an arbitrary input x 2 RDin to RDhid using RMSNorm [65] and GELU activation [66] with learnable W(i)

1 2 RDin Dhid, W(i)

2 2 RDhid Dhid, b(i)

2 2 RDhid, and Dhid is hidden dimension.

Encoding Semantic Tokens The remaining conditions are the discrete tokens representing semantic information u1, . . . , u TST. Following the typical approach for embedding natural languages [8], we directly use a lookup table of vectors to map any token ut 2 {1, . . . , VST} into a real-valued vector E(ut) 2 RDhid, where VST denotes the vocabulary size of the semantic tokens, i.e., the number of clusters in k-means for Wav2Vec2-Conformer. By stacking the vectors along the time axis and applying another MLP block, we obtain EST := MLP(2) ([E(u1), . . . , E(u TST)]) 2 RTST Dhid.

Conditional on the computed embeddings Eδ and EST, we next show how the network input, i.e., zδt for the case of having same noise scale δt for all chunks and zδ1 zδM for the case of having different noise scales, is processed in DPD for velocity prediction. For the simplicity of notation, we use zδt to denote the network input here and below. zδt is ﬁrst linearly transformed and added up with the angle embedding of the same shape: H := RMSNorm (zδt Win + Eδ) , where Win 2 RD Dhid is learnable. Then, a crucial segmentation operation is applied for dual-path modeling.

Segmentation As illustrated in Figure 3, the segmentation module divides a 2-D input into S half-overlapping segments each of length K, represented by a 3-D tensor H := [0, H1, . . . , HS, 0] 2

RS K Dhid, where Hs := H

, and H is zero-padded such that we have

Stack chunks

Coarse path

Figure 3: Diagram for visually understanding the segmentation operation for dual-path modeling

+ 1. By choosing a segment size K

L, the costs of sequence processing become sub-linear (O(

L)) as opposed to (O(L)). This greatly reduces the difﬁculty of learning a very long sequence and permits Me Lo Dy to use higher-frequency latents for better audio quality. In this work, 250Hz latent sequences was used. In comparison, Music LM [5] was built upon 50Hz codec.

Dual-Path Blocks After the segmentation, we obtain a 3-D tensor input. As shown in Figure 2, the tensor is subsequently passed to N dual-path blocks, where each block contains two processing stages corresponding to coarse-path (i.e., inter-segment) and ﬁne-path (i.e., intra-segment) processing, respectively. Similar to the observations in [40, 41], we ﬁnd it superior to use an attention-based network for coarse-path processing and to use a bi-directional RNN for ﬁne-path processing. The goal of ﬁne acoustic modeling is to better reconstruct the ﬁne details from the roughly determined audio structure [3]. At a ﬁner scope, only the nearby elements matter and contain the most information needed for reﬁnement, as supported by the modeling perspectives in neural vocoding [33, 35]. Speciﬁcally, we employ the Roformer network [67] for coarse-path processing, where we use a self-attention layer followed by a cross-attention layer to be conditional on EST with rotary positional embeddings. On the other hand, we use a stack of 2-layer simple recurrent units (SRUs) [68] for ﬁne-path processing. A feature-wise linear modulation (Fi LM) [69] layer is applied to the output of SRUs to assist the denoising with the angle embedding Eδ and the pooled EST. The details of inner mechanism in each dual-path block is presented in Appendix B.

4.2.4 Music Generation and Continuation

Suppose we have a well-trained multi-chunk velocity model ˆv , we begin with a L-length latent generation, where L is the length of latents we used in training. According to Appendix A, the DDIM sampling algorithm [26] can be re-formulated by applying the trigonometric identities:

zδt !t = cos(!t)zδt sin(!t)ˆv (zδt; c), (12) which, by running from t = T to t = 1 using the deﬁned !t in Eq. (4), we can generate a sample of z0 of length L. To continue generation, we append a new chunk composed of random noises to the generated z0 and drop the ﬁrst chunk in z0. Recall that the inputs to ˆv are the M concatenated noisy latents of different noise scales. The continuation of generation is feasible since the conditions (i.e., the semantic tokens and the angle vector) deﬁned in DPD have an autoregressive nature at inference time. On one hand, the semantic tokens are generated by the semantic LM in an autoregressive manner, therefore we can continue the generation of semantic tokens for the new chunk. On the other hand, since the multi-chunk model ˆv is trained to tackle chunks of different noise scales with respect to the angle vector, we can simply ignore the generated audio (on the ﬁrst M 1 chunks) by zeroing the respective values and setting ones for the newly appended chunk, i.e., δnew := [0]L d L/Me

r=1 [δt]d L/Me

r=1 . Then, the newly appended noise chunk can be transformed to meaningful music audio after d T/Me step of DDIM sampling. For more details of generation, we present the corresponding algorithms in Appendix C. Besides music continuation, based on Mu Lan, Me Lo Dy also supports music prompts to generate music of a similar style, as shown in Figure 1. Examples of music continuation, and music prompting are shown on our demo page.

Table 2: The speed and the quality of our proposed Me Lo Dy on a CPU (Intel Xeon Platinum 8260 CPU @ 2.40GHz) or a GPU (NVIDIA Tesla V100) using different numbers of sampling steps.

Steps (T) Speed on CPU (") Speed on GPU (") FAD (#) MCC (")

(Music Caps) - - - 0.43

5 1472Hz (0.06 ) 181.1k Hz (7.5 ) 7.23 0.49 10 893Hz (0.04 ) 104.8k Hz (4.4 ) 5.93 0.52 20 498Hz (0.02 ) 56.9k Hz (2.4 ) 5.41 0.53

5 Experiments

5.1 Experimental Setup

Data Preparation As shown in Table 1, Me Lo Dy was trained on 257k hours of music data (6.4M 24k Hz audios), which were ﬁltered with [27] to focus on non-vocal music. Additionally, inspired by the text augmentation in [6], we enriched the tag-based texts to generate music captions by asking Chat GPT [70]. This music description pool is used for the training of our 195.3M Mu Lan, where we randomly paired each audio with either the generated caption or its respective tags. In this way, we robustly improve the model s capability of handling free-form text.

Semantic LM For semantic modeling, we trained a 429.5M LLa MA [71] with 24 layers, 8 heads, and 2048 hidden dimensions, which has a comparable number of parameters to that of the Music LM [5]. For conditioning, we set up the Mu Lan RVQ using 12 1024-sized codebooks, resulting in 12 preﬁxing tokens. The training targets were 10s semantic tokens, which are obtained from discretizing the 25Hz embeddings from a 199.5M Wav2Vec2-Conformer with 1024-center k-means.

Dual-Path Diffusion For the DPD model, we set the hidden dimension to Dhid = 768, and block number to N = 8, resulting in 296.6M parameters. For the input chunking strategy, we divide the 10s training inputs in a ﬁxed length of L = 2500 into M = 4 parts. For segmentation, we used a

segment size of K = 64 (i.e., each segment is 256ms long), leading to S = 80 segments. In addition, we applied the classiﬁer-free guidance (CFG) [72] to DPD to improve the correspondence between samples and conditions. During training, the cross-attention to semantic tokens is randomly replaced by self-attention with a probability of 0.1. For sampling, the unconditional prediction vuncond and the conditional prediction vcond are linearly combined: vcond + (1 )vuncond with a scale of = 2.5.

Audio VAE-GAN For audio VAE-GAN, we used a hop size of 96, resulting in 250Hz latent sequences for encoding 24k Hz music audio. The latent dimension D = 16, thus we have a total compression rate of 6 . The hidden channels used in the encoder were 256, whereas that used in the decoder were 768. The audio VAE-GAN in total contains 100.1M parameters.

5.2 Performance Analysis

Objective Metrics We use the VGGish-based [73] Fre chet audio distance (FAD) [74] between the generated audios and the reference audios from Music Caps [5] as a rough measure of generation ﬁdelity.2 To measure text correlation, we use the Mu Lan cycle consistency (MCC) [5], which calculates the cosine similarity between text and audio embeddings using a pre-trained Mu Lan.3

Inference Speed We ﬁrst evaluate the sampling efﬁciency of our proposed Me Lo Dy. As DPD permits using different numbers of sampling steps depending on our needs, we report its generation speed in Table 2. Surprisingly, Me Lo Dy steadily achieved a higher MCC score than that of the reference set, even taking only 5 sampling steps. This means that (i) the Mu Lan model determined that our generated samples were more correlated to Music Caps captions than reference audios, and (ii) the proposed DPD is capable of consistently completing the Mu Lan cycle at signiﬁcantly lower costs than the nested LMs in [5].

2Note that Me Lo Dy was mainly trained with non-vocal music data, its sample distribution could not ﬁt the reference one as well as in [5, 6], since about 76% audios in Music Caps contain either vocals or speech.

3Since our Mu Lan model was trained with a different dataset, our MCC results cannot be compared to [5, 6].

Table 3: The comparison of Me Lo Dy with the SOTA text-to-music generation models. NFE is the number of function evaluations [58] for generating T-second audio.5 Musicality, Quality, and Text Corr. are the winning proportions in terms of musicality, quality, and text correlation, respectively.

Model NFE (#) Musicality (") Quality (") Text Corr. (")

MLM N2M MLM N2M MLM N2M

Music LM [5] (25 + 200 + 400)T 0.541 - 0.465 - 0.548 - Noise2Music [6] 1000 + 800 + 800 - 0.555 - 0.436 - 0.572

Me Lo Dy (20 steps) 25T + 20 0.459 0.445 0.535 0.564 0.452 0.428

Comparisons with SOTA models We evaluate the performance of Me Lo Dy by comparing it to Music LM [5] and Noise2Music [6], which both were trained large-scale music datasets and demonstrated SOTA results for a wide range of text prompts. To conduct fair comparisons, we used the same text prompts in their demos (70 samples from Music LM; 41 samples from Noise2Music),4 and asked seven music producers to select the best out of a pair of samples or voting for a tie (both win) in terms of musicality, audio quality, and text correlation. In total, we conducted 777 comparisons and collected 1,554 ratings. We detail the evaluation protocol in Appendix F. Table 3 shows the comparison results, where each category of ratings is separated into two columns, representing the comparison against Music LM (MLM) or Noise2Music (N2M), respectively. Finally, Me Lo Dy consistently achieved comparable performances (all winning proportions fall into [0.4, 0.6]) in musicality and text correlation to Music LM and Noise2Music. Regarding audio quality, Me Lo Dy outperformed Music LM (p < 0.05) and Noise2Music (p < 0.01), where the p-values were calculated using the Wilcoxon signed-rank test. We note that, to sample 10s and 30s music, Me Lo Dy only takes 4.32% and 0.41% NFEs of Music LM, and 10.4% and 29.6% NFEs of Noise2Music, respectively.

Diversity Analysis Diffusion models are distinguished for its high diversity [25]. We conduct an additional experiment to study the diversity and validity of Me Lo Dy s generation given the same text prompt of open description, e.g., feelings or scenarios. The sampled results were shown on our demo page, in which we obtained samples with diverse combinations of instruments and textures.

Ablation Studies We also study the ablation on two aspects of the proposed method. In Appendix D, we compared the uniform angle schedule in [23] and the linear one proposed in DPD using the MCC metric and case-by-case qualitative analysis. It turns out that our proposed schedule tends to induce fewer acoustic issues when taking a small number of sampling steps. In Appendix E, we showed that the proposed dual-path architecture outperformed other architectures [23, 31] used for LDMs in terms of the signal-to-noise ratio (SNR) improvements using a subset of the training data.

6 Discussion

Limitation We acknowledge the limitations of our proposed Me Lo Dy. To prevent from having any disruption caused by unnaturally sound vocals, our training data was prepared to mostly contain non-vocal music only, which may limit the range of effective prompts for Me Lo Dy. Besides, the training corpus we used was unbalanced and slightly biased towards pop and classical music. Lastly, as we trained the LM and DPD on 10s segments, the dynamics of a long generation may be limited.

Broader Impact We believe our work has a huge potential to grow into a music creation tool for music producers, content creators, or even normal users to seamlessly express their creative pursuits with a low entry barrier. Me Lo Dy also facilitates an interactive creation process, as in Midjourney [24], to take human feedback into account. For a more precise tune of Me Lo Dy on a musical style, the Lo RA technique [75] can be potentially applied to Me Lo Dy, as in Stable Diffusion [49].

4All samples for evaluation are available at https://Efﬁcient-Me Lo Dy.github.io/. Note that our samples were not cherry-picked, whereas the samples we compared were cherry-picked [6], constituting very strong baselines.

5We use + to separate the counts for the iterative modules, i.e., LM or DPM. Suppose the cost of each module is comparable, then the time steps taken by LM and the diffusion steps taken by DPM can be fairly compared.

[1] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. Advances

in Neural Information Processing Systems, 33:6840 6851, 2020.

[2] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-

resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10684 10695, 2022.

[3] Zalán Borsos, Raphaël Marinier, Damien Vincent, Eugene Kharitonov, Olivier Pietquin, Matt

Shariﬁ, Olivier Teboul, David Grangier, Marco Tagliasacchi, and Neil Zeghidour. Audiolm: a language modeling approach to audio generation. ar Xiv preprint ar Xiv:2209.03143, 2022.

[4] Prafulla Dhariwal, Heewoo Jun, Christine Payne, Jong Wook Kim, Alec Radford, and Ilya

Sutskever. Jukebox: A generative model for music. ar Xiv preprint ar Xiv:2005.00341, 2020.

[5] Andrea Agostinelli, Timo I Denk, Zalán Borsos, Jesse Engel, Mauro Verzetti, Antoine Caillon,

Qingqing Huang, Aren Jansen, Adam Roberts, Marco Tagliasacchi, et al. Musiclm: Generating music from text. ar Xiv preprint ar Xiv:2301.11325, 2023.

[6] Qingqing Huang, Daniel S Park, Tao Wang, Timo I Denk, Andy Ly, Nanxin Chen, Zhengdong

Zhang, Zhishuai Zhang, Jiahui Yu, Christian Frank, et al. Noise2music: Text-conditioned music generation with diffusion models. ar Xiv preprint ar Xiv:2302.03917, 2023.

[7] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez,

Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017.

[8] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of

deep bidirectional transformers for language understanding. ar Xiv preprint ar Xiv:1810.04805, 2018.

[9] Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared

Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code. ar Xiv preprint ar Xiv:2107.03374, 2021.

[10] Romal Thoppilan, Daniel De Freitas, Jamie Hall, Noam Shazeer, Apoorv Kulshreshtha, Heng-

Tze Cheng, Alicia Jin, Taylor Bos, Leslie Baker, Yu Du, et al. Lamda: Language models for dialog applications. ar Xiv preprint ar Xiv:2201.08239, 2022.

[11] Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin,

Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730 27744, 2022.

[12] Felix Kreuk, Gabriel Synnaeve, Adam Polyak, Uriel Singer, Alexandre Défossez, Jade Copet,

Devi Parikh, Yaniv Taigman, and Yossi Adi. Audiogen: Textually guided audio generation. ar Xiv preprint ar Xiv:2209.15352, 2022.

[13] Chengyi Wang, Sanyuan Chen, Yu Wu, Ziqiang Zhang, Long Zhou, Shujie Liu, Zhuo Chen,

Yanqing Liu, Huaming Wang, Jinyu Li, et al. Neural codec language models are zero-shot text

to speech synthesizers. ar Xiv preprint ar Xiv:2301.02111, 2023.

[14] Eugene Kharitonov, Damien Vincent, Zalán Borsos, Raphaël Marinier, Sertan Girgin, Olivier

Pietquin, Matt Shariﬁ, Marco Tagliasacchi, and Neil Zeghidour. Speak, read and prompt: High-ﬁdelity text-to-speech with minimal supervision. ar Xiv preprint ar Xiv:2302.03540, 2023.

[15] Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsuper-

vised learning using nonequilibrium thermodynamics. In International Conference on Machine Learning, pages 2256 2265, 2015.

[16] Diederik Kingma, Tim Salimans, Ben Poole, and Jonathan Ho. Variational diffusion models.

Advances in neural information processing systems, 34:21696 21707, 2021.

[17] Prafulla Dhariwal and Alexander Nichol. Diffusion models beat gans on image synthesis.

Advances in neural information processing systems, 34, 2021.

[18] Rongjie Huang, Max WY Lam, Jun Wang, Dan Su, Dong Yu, Yi Ren, and Zhou Zhao. Fast-

diff: A fast conditional diffusion model for high-quality speech synthesis. ar Xiv preprint ar Xiv:2204.09934, 2022.

[19] Sungwon Kim, Heeseung Kim, and Sungroh Yoon. Guided-tts 2: A diffusion model for

high-quality adaptive text-to-speech with untranscribed data. ar Xiv preprint ar Xiv:2205.15370, 2022.

[20] Kai Shen, Zeqian Ju, Xu Tan, Yanqing Liu, Yichong Leng, Lei He, Tao Qin, Sheng Zhao, and

Jiang Bian. Naturalspeech 2: Latent diffusion models are natural and zero-shot speech and singing synthesizers. ar Xiv preprint ar Xiv:2304.09116, 2023.

[21] Haohe Liu, Zehua Chen, Yi Yuan, Xinhao Mei, Xubo Liu, Danilo Mandic, Wenwu Wang, and

Mark D Plumbley. Audioldm: Text-to-audio generation with latent diffusion models. ar Xiv preprint ar Xiv:2301.12503, 2023.

[22] Rongjie Huang, Jiawei Huang, Dongchao Yang, Yi Ren, Luping Liu, Mingze Li, Zhenhui

Ye, Jinglin Liu, Xiang Yin, and Zhou Zhao. Make-an-audio: Text-to-audio generation with

prompt-enhanced diffusion models. ar Xiv preprint ar Xiv:2301.12661, 2023.

[23] Flavio Schneider, Zhijing Jin, and Bernhard Schölkopf. Moûsai: Text-to-music generation with

long-context latent diffusion. ar Xiv preprint ar Xiv:2301.11757, 2023.

[24] David Holz et al. Midjourney. Artiﬁcial Intelligence platform. Accessible at https://www.midjourney.com/Accessed November 1st, 2023.

[25] Zhisheng Xiao, Karsten Kreis, and Arash Vahdat. Tackling the generative learning trilemma

with denoising diffusion gans. ar Xiv preprint ar Xiv:2112.07804, 2021.

[26] Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. ar Xiv

preprint ar Xiv:2010.02502, 2020.

[27] Qiuqiang Kong, Yin Cao, Turab Iqbal, Yuxuan Wang, Wenwu Wang, and Mark D Plumbley.

Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 28:2880 2894, 2020.

[28] Antoine Caillon and Philippe Esling. Rave: A variational autoencoder for fast and high-quality

neural audio synthesis. ar Xiv preprint ar Xiv:2111.05011, 2021.

[29] Marco Pasini and Jan Schlüter. Musika! fast inﬁnite waveform music generation. ar Xiv preprint

ar Xiv:2208.08706, 2022.

[30] Mubert Inc. Mubert. URL https://mubert.com/, 2023.

[31] S Forsgren and H Martiros. Riffusion - stable diffusion for real-time music generation. URL

https://riffusion.com/, 2023.

[32] Jort F Gemmeke, Daniel PW Ellis, Dylan Freedman, Aren Jansen, Wade Lawrence, R Channing

Moore, Manoj Plakal, and Marvin Ritter. Audio set: An ontology and human-labeled dataset for audio events. In 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP), pages 776 780. IEEE, 2017.

[33] Aaron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex

Graves, Nal Kalchbrenner, Andrew Senior, and Koray Kavukcuoglu. Wavenet: A generative model for raw audio. ar Xiv preprint ar Xiv:1609.03499, 2016.

[34] Aaron Oord, Yazhe Li, Igor Babuschkin, Karen Simonyan, Oriol Vinyals, Koray Kavukcuoglu,

George Driessche, Edward Lockhart, Luis Cobo, Florian Stimberg, et al. Parallel wavenet: Fast high-ﬁdelity speech synthesis. In International conference on machine learning, pages 3918 3926. PMLR, 2018.

[35] Nal Kalchbrenner, Erich Elsen, Karen Simonyan, Seb Noury, Norman Casagrande, Edward

Lockhart, Florian Stimberg, Aaron Oord, Sander Dieleman, and Koray Kavukcuoglu. Efﬁcient neural audio synthesis. In International Conference on Machine Learning, pages 2410 2419. PMLR, 2018.

[36] Zalân Borsos, Matt Shariﬁ, Damien Vincent, Eugene Kharitonov, Neil Zeghidour, and Marco

Tagliasacchi. Soundstorm: Efﬁcient parallel audio generation. ar Xiv preprint ar Xiv:2305.09636, 2023.

[37] Huiwen Chang, Han Zhang, Lu Jiang, Ce Liu, and William T Freeman. Maskgit: Masked

generative image transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11315 11325, 2022.

[38] Yi Luo, Zhuo Chen, and Takuya Yoshioka. Dual-path rnn: efﬁcient long sequence modeling

for time-domain single-channel speech separation. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 46 50. IEEE, 2020.

[39] Jingjing Chen, Qirong Mao, and Dong Liu. Dual-path transformer network: Direct context-

aware modeling for end-to-end monaural speech separation. ar Xiv preprint ar Xiv:2007.13975, 2020.

[40] Max WY Lam, Jun Wang, Dan Su, and Dong Yu. Effective low-cost time-domain audio

separation using globally attentive locally recurrent networks. In 2021 IEEE Spoken Language Technology Workshop (SLT), pages 801 808. IEEE, 2021.

[41] Max WY Lam, Jun Wang, Dan Su, and Dong Yu. Sandglasset: A light multi-granularity self-

attentive network for time-domain speech separation. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5759 5763. IEEE, 2021.

[42] Cem Subakan, Mirco Ravanelli, Samuele Cornell, Mirko Bronzi, and Jianyuan Zhong. Attention

is all you need in speech separation. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 21 25. IEEE, 2021.

[43] Shengkui Zhao and Bin Ma. Mossformer: Pushing the performance limit of monaural speech

separation using gated single-head transformer with convolution-augmented joint self-attentions. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1 5. IEEE, 2023.

[44] Yu-An Chung, Yu Zhang, Wei Han, Chung-Cheng Chiu, James Qin, Ruoming Pang, and

Yonghui Wu. W2v-bert: Combining contrastive learning and masked language modeling

for self-supervised speech pre-training. In 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pages 244 250. IEEE, 2021.

[45] Neil Zeghidour, Alejandro Luebs, Ahmed Omran, Jan Skoglund, and Marco Tagliasacchi.

Soundstream: An end-to-end neural audio codec. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 30:495 507, 2021.

[46] Qingqing Huang, Aren Jansen, Joonseok Lee, Ravi Ganti, Judith Yue Li, and Daniel PW

Ellis. Mulan: A joint embedding of music audio and natural language. ar Xiv preprint ar Xiv:2208.12415, 2022.

[47] Alexei Baevski, Yuhao Zhou, Abdelrahman Mohamed, and Michael Auli. wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances in neural information processing systems, 33:12449 12460, 2020.

[48] Anmol Gulati, James Qin, Chung-Cheng Chiu, Niki Parmar, Yu Zhang, Jiahui Yu, Wei Han,

Shibo Wang, Zhengdong Zhang, Yonghui Wu, et al. Conformer: Convolution-augmented transformer for speech recognition. ar Xiv preprint ar Xiv:2005.08100, 2020.

[49] Robin Rombach and Patrick Esser. Stable diffusion v2-1. URL https://huggingface.co/stabilityai/stable-diffusion-2-1, 2023.

[50] Lvmin Zhang and Maneesh Agrawala. Adding conditional control to text-to-image diffusion

models. ar Xiv preprint ar Xiv:2302.05543, 2023.

[51] Diederik P Kingma and Max Welling. Auto-encoding variational bayes. ar Xiv preprint ar Xiv:1312.6114, 2013.

[52] Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. Stochastic backpropagation

and approximate inference in deep generative models. In International conference on machine learning, pages 1278 1286. PMLR, 2014.

[53] Antonia Creswell, Tom White, Vincent Dumoulin, Kai Arulkumaran, Biswa Sengupta, and

Anil A Bharath. Generative adversarial networks: An overview. IEEE signal processing magazine, 35(1):53 65, 2018.

[54] Jungil Kong, Jaehyeon Kim, and Jaekyoung Bae. Hiﬁ-gan: Generative adversarial networks

for efﬁcient and high ﬁdelity speech synthesis. Advances in Neural Information Processing Systems, 33:17022 17033, 2020.

[55] Won Jang, Dan Lim, Jaesam Yoon, Bongwan Kim, and Juntae Kim. Univnet: A neural vocoder

with multi-resolution spectrogram discriminators for high-ﬁdelity waveform generation. ar Xiv preprint ar Xiv:2106.07889, 2021.

[56] Alexander Quinn Nichol and Prafulla Dhariwal. Improved denoising diffusion probabilistic

models. In International Conference on Machine Learning, pages 8162 8171. PMLR, 2021.

[57] Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and

Ben Poole. Score-based generative modeling through stochastic differential equations. ar Xiv preprint ar Xiv:2011.13456, 2020.

[58] Tim Salimans and Jonathan Ho. Progressive distillation for fast sampling of diffusion models.

ar Xiv preprint ar Xiv:2202.00512, 2022.

[59] Tero Karras, Miika Aittala, Timo Aila, and Samuli Laine. Elucidating the design space of

diffusion-based generative models. ar Xiv preprint ar Xiv:2206.00364, 2022.

[60] Alexandre Défossez, Jade Copet, Gabriel Synnaeve, and Yossi Adi. High ﬁdelity neural audio

compression. ar Xiv preprint ar Xiv:2210.13438, 2022.

[61] Nanxin Chen, Yu Zhang, Heiga Zen, Ron J Weiss, Mohammad Norouzi, and William Chan.

Wavegrad: Estimating gradients for waveform generation. In International conference on learning representations, 2020.

[62] Max WY Lam, Jun Wang, Dan Su, and Dong Yu. Bddm: Bilateral denoising diffusion

models for fast and high-quality speech synthesis. In International Conference on Learning Representations, 2022.

[63] Zhifeng Kong, Wei Ping, Jiaji Huang, Kexin Zhao, and Bryan Catanzaro. Diffwave: A versatile

diffusion model for audio synthesis. ar Xiv preprint ar Xiv:2009.09761, 2020.

[64] Ken Shoemake. Animating rotation with quaternion curves. In Proceedings of the 12th annual

conference on Computer graphics and interactive techniques, pages 245 254, 1985.

[65] Biao Zhang and Rico Sennrich. Root mean square layer normalization. Advances in Neural

Information Processing Systems, 32, 2019.

[66] Dan Hendrycks and Kevin Gimpel. Gaussian error linear units (gelus). ar Xiv preprint ar Xiv:1606.08415, 2016.

[67] Jianlin Su, Yu Lu, Shengfeng Pan, Ahmed Murtadha, Bo Wen, and Yunfeng Liu. Roformer:

Enhanced transformer with rotary position embedding. ar Xiv preprint ar Xiv:2104.09864, 2021.

[68] Tao Lei, Yu Zhang, Sida I Wang, Hui Dai, and Yoav Artzi. Simple recurrent units for highly

parallelizable recurrence. ar Xiv preprint ar Xiv:1709.02755, 2017.

[69] Ethan Perez, Florian Strub, Harm De Vries, Vincent Dumoulin, and Aaron Courville. Film:

Visual reasoning with a general conditioning layer. In Proceedings of the AAAI Conference on Artiﬁcial Intelligence, volume 32, 2018.

[70] Open AI. Chatgpt. URL https://chat.openai.com/, 2023.

[71] Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timo-

thée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. Llama: Open and efﬁcient foundation language models. ar Xiv preprint ar Xiv:2302.13971, 2023.

[72] Jonathan Ho and Tim Salimans. Classiﬁer-free diffusion guidance. ar Xiv preprint ar Xiv:2207.12598, 2022.

[73] Shawn Hershey, Sourish Chaudhuri, Daniel PW Ellis, Jort F Gemmeke, Aren Jansen, R Chan-

ning Moore, Manoj Plakal, Devin Platt, Rif A Saurous, Bryan Seybold, et al. Cnn architectures for large-scale audio classiﬁcation. In 2017 ieee international conference on acoustics, speech and signal processing (icassp), pages 131 135. IEEE, 2017.

[74] Kevin Kilgour, Mauricio Zuluaga, Dominik Roblek, and Matthew Shariﬁ. Fréchet audio distance:

A reference-free metric for evaluating music enhancement algorithms. In INTERSPEECH, pages 2350 2354, 2019.

[75] Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang,

Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models. ar Xiv preprint ar Xiv:2106.09685, 2021.