# diversityrewarded_cfg_distillation__4d97ae1c.pdf

Published as a conference paper at ICLR 2025

DIVERSITY-REWARDED CFG DISTILLATION

Geoffrey Cideron , Andrea Agostinelli, Johan Ferret, Sertan Girgin, Romuald Elie Olivier Bachem, Sarah Perrin*, Alexandre Ram e*

Google Deep Mind, * Equal advisory contribution

Generative models are transforming creative domains such as music generation, with inference-time strategies like Classiﬁer-Free Guidance (CFG) playing a crucial role. However, CFG doubles inference cost while limiting originality and diversity across generated contents. In this paper, we introduce diversity-rewarded CFG distillation, a novel ﬁnetuning procedure that distills the strengths of CFG while addressing its limitations. Our approach optimises two training objectives: (1) a distillation objective, encouraging the model alone (without CFG) to imitate the CFG-augmented predictions, and (2) an RL objective with a diversity reward, promoting the generation of diverse outputs for a given prompt. By ﬁnetuning, we learn model weights with the ability to generate high-quality and diverse outputs, without any inference overhead. This also unlocks the potential of weight-based model merging strategies: by interpolating between the weights of two models (the ﬁrst focusing on quality, the second on diversity), we can control the quality-diversity trade-off at deployment time, and even further boost performance. We conduct extensive experiments on the Music LM text-to-music generative model, where our approach surpasses CFG in terms of quality-diversity Pareto optimality. According to human evaluators, our ﬁnetuned-then-merged model generates samples with higher quality-diversity than the base model augmented with CFG. Explore our generations at google-research.github.io/seanet/musiclm/diverse music.

1 INTRODUCTION

Generative models for creative domains. Art and entertainment domains historically driven by human creativity are undergoing a profound transformation thanks to AI generative models. These models, often powered by Large Language Models (LLMs) or diffusion, can now generate texts (Gemini Team, 2023), images (Ramesh et al., 2022), videos (Ho et al., 2022), and audios (Borsos et al., 2023; Kreuk et al., 2023; Copet et al., 2023; Agostinelli et al., 2023; Cideron et al., 2024; D efossez et al.). To further reﬁne the quality, these models are often augmented with inference methods during real-world deployment, ranging from simple temperature scaling to more reﬁned methods like Beam search (Freitag & Al-Onaizan, 2017), test-time augmentation (Shanmugam et al., 2021), or MCTS (Kocsis & Szepesv ari, 2006). A particularly popular method for image and audio generation is classiﬁer-free guidance (CFG) (Ho & Salimans, 2022), e.g. used in DALL-E (Ramesh et al., 2022) or Audio Gen (Kreuk et al., 2023). CFG improves the model s ﬁdelity to the prompt by combining the logits of conditional and unconditional generations. Despite its beneﬁts, CFG has two main limitations: it doubles the computational cost during deployment and reduces the diversity of generated content (Ho & Salimans, 2022; Dhariwal & Nichol, 2021; Kreuk et al., 2023; Meng et al., 2023), hindering the exploration of novel and diverse ideas a cornerstone of creativity. Ideally, these models should not only fulﬁll user intent but also surprise them with unexpected and innovative outputs: the model should not generate systematically the same content (Hamilton, 2024).

Quality-diversity trade-off. Effectively controlling the quality-diversity trade-off is thus extremely important but challenging. On the one hand, optimising quality usually reduces diversity; this limitation affects inference methods such as CFG but also ﬁnetuning stages such as RLHF (Kirk et al., 2024; Mohammadi, 2024). On the other hand, promoting diversity usually reduces quality (Brown

Correspondence to: Geoffrey Cideron <gcideron@google.com>

Published as a conference paper at ICLR 2025

3.8 4.0 4.2 4.4 4.6 Quality

LERP(0, 15) CFG

Figure 1: Left. Illustration of the two objectives: CFG distillation (above) and the diversity reward (below), multiplied by the diversity coefﬁcient β in the joint ﬁnetuning objective. Right. Qualitydiversity trade-off for different strategies. The ﬁrst four lines represent the training trajectories of our approach, distilling CFG (with γ = 3) with varying diversity coefﬁcient β in {0, 5, 10, 15}; every 500 training steps, we evaluate the quality and diversity of the generations. Larger values of β lead to more diverse models yet slightly less quality. For linear interpolation (LERP), each cross corresponds to a 0 λ 1 when interpolating between the weights θq of a quality-focused model (β = 0) and those θd of a diversity-focused model (β = 15); the evaluated generations are obtained from the weights (1 λ) θq + λ θd. For the CFG baseline, each dot corresponds to a different value for the guidance factor 1 γ 7. This plot shows that our method improves the quality-diversity trade-off; notably, LERP uncovers a strong and steerable front of solutions by just interpolating between the weights of two models, at deployment time.

et al., 2005), e.g. when increasing temperature at inference (Zhang et al., 2021). As further discussed in Section 4, quality-diversity algorithms (Lehman & Stanley, 2011; Mouret & Clune, 2015) seek to train a population of models with diverse abilities. In contrast, optimising directly the diversity of generations of a single model is an under-explored yet promising avenue, that we investigate here.

Diversity-rewarded CFG distillation. In this work, we introduce a novel ﬁnetuning strategy to enhance the quality-diversity trade-off in generative models for creative domains. Speciﬁcally, we combine distillation and reinforcement learning (RL) to optimise two complementary objectives. The ﬁrst is a novel CFG distillation objective for LLMs where we distill the behavior of a teacher into the student. Critically, the teacher is the CFG-augmented base model, rather than a larger third-party model as commonly done in the literature (Hinton et al., 2015). This involves minimizing the KL divergence between the logits of the teacher and the student on the data distribution generated by the student (to reduce train-test mismatch), following the on-policy distillation framework of Agarwal et al. (2024). By distilling CFG into the model weights, we improve generation quality while eliminating CFG s inference overhead. The second is a novel RL with a diversity reward objective, maximising the diversity across pairs of generations for a given prompt. Diversity is measured with by comparing pairs of generations by ﬁrst embedding them and then computing their negative (cosine) similarity in this embedding space. Combining these two objectives allows the ﬁnetuned model to inherit the quality of CFG without its cost, while simultaneously maintaining diversity through the RL objective, thus solving the two main drawbacks of CFG at inference time.

Model merging. The hyperparameter β (multiplier for the diversity reward) allows controlling the quality-diversity trade-off at training time, as shown in Figure 1 (right); in contrast, traditional CFG can do it at deployment time. To enable this level of control within our approach, we propose a third contribution involving model merging. Speciﬁcally, we ﬁnetune two models, one focusing on quality (low β) and the other focusing on diversity (high β), and then combine their weights by linear interpolation (LERP) (Utans, 1996), balancing between quality and diversity. This follows from the linear mode connectivity property (Frankle et al., 2020) and recent advances in model merging (Wortsman et al., 2022a; Ram e et al., 2022; Ilharco et al., 2023), showing that weights ﬁnetuned from a shared pretrained initialisation can be interpolated, despite the non-linearities in the architecture.

Published as a conference paper at ICLR 2025

Notably, Ram e et al. (2023) showed that interpolating between weights ﬁnetuned on different rewards trades their abilities off. Thus, we interpolate between the weights of a quality-focused model and a diversity-focused model: sliding the interpolating coefﬁcient λ between 0 and 1 uncovers a strong and steerable quality-diversity front of solutions, without overhead at deployment. Crucially, we ﬁnd that the interpolated model using λ = 0.5 is our best model, outperforming models explicitly ﬁnetuned for intermediate values of the diversity hyperparameter β.

Contribution 1: CFG distillation for quality in LLMs. We distill the quality beneﬁts of a costly inference-time strategy, CFG, into the weights of our model.

Contribution 2: Reinforcement learning for diversity. We reward the model to create diverse generations, reducing the drop in diversity caused by CFG and ﬁnetuning.

Contribution 3: Model merging for Pareto-optimality. We trade-off quality and diversity at deployment time by interpolating between the weights of a quality-focused model and a diversity-focused model.

Music generation. We apply our strategy to text-to-music generation, a creative task where balancing quality and diversity is key. Speciﬁcally, we ﬁnetune Music LM (Agostinelli et al., 2023) and consistently improve the quality-diversity trade-off achieved by the CFG-augmented previous stateof-the-art (Cideron et al., 2024). Our experiments, featuring human evaluations, validate that our models generate more diverse music while maintaining high quality.

2 DIVERSITY-REWARDED CFG DISTILLATION

Notations. Let mθ denote an auto-regressive model parameterised by θ. Given an input sequence x from the space of sequences X, the model mθ deﬁnes a policy πθ by sequential sampling of an output sequence y = (s1, ..., s L) of length L. Speciﬁcally, given a partial sequence y<n = (s1, ..., sn 1), the policy πθ samples the next token sn with temperature T in the softmax such as πθ(sn|y<n, x) exp(zn/T) where zn = mθ(sn|y<n, x) is the corresponding logit. We also deﬁne pθ(y|x) = QL n=1 πθ(sn|y<n, x) the probability of sampling the output sequence y from input x.

2.1 CFG DISTILLATION FOR QUALITY

CFG. Given a model mθ and a partial sequence y<n, CFG is an inference-time strategy that samples the next token by combining conditional logits mθ(sn|y<n, x) (where x is the user s prompt) and unconditional/negative logits mθ(sn|y<n, x ) (where x is an empty or negative prompt):

z CF Gγ n = γ mθ(sn|y<n, x) + (1 γ) mθ(sn|y<n, x ). (1)

The next-token probability distribution of the CFG-augmented policy is then πCF Gγ θ (sn|y<n, x) exp(z CF Gγ n /T). The guidance factor γ controls the adherence to the prompts; higher values typically lead to higher quality (Ho & Salimans, 2022). We use γ > 1, extrapolating rather than interpolating, as done in Kreuk et al. (2023). For example, if x is Soulful jazz song. and x is Bad audio quality. , a larger γ increases the degree to which the generated sequence resembles typical elements of a soulful jazz song while increasing audio quality.

CFG distillation. One key limitation is that CFG doubles the inference cost, since it requires the computation of two sets of logits. To eliminate this overhead, we propose an objective that directly distills the beneﬁts of CFG into the model weights taking inspiration from similar approaches in diffusion (Luo et al., 2023; Yin et al., 2024; Saito et al., 2024; Bai et al., 2023; Novack et al., 2024). Knowledge distillation (Hinton et al., 2015) traditionally involves compressing a large teacher model into a smaller student model (Sanh et al., 2019; Agarwal et al., 2024), allowing for efﬁcient deployment while approximating the teacher s performance. In contrast, our teacher and student share the same architecture: the teacher (with weights θinit from initialisation) is simply augmented with CFG. We then follow the standard distillation approach that encourages the student to match logits from the teacher.

Published as a conference paper at ICLR 2025

On-policy distillation. If the teacher provides the label, the remaining question is: how to sample the input data? i.e. the completions on which teacher s and student s logits should match. Be given a set of prompts, ofﬂine distillation uses inputs sampled from the teacher (Lin et al., 2020); in contrast, we adopt the on-policy distillation strategy from Agarwal et al. (2024), where data is online and dynamically sampled from the student itself. This reduces the train-test mismatch, a.k.a the exposure bias (Bengio et al., 2015), and was used in recent state-of-the-art LLMs (Gemma Team et al., 2024)1.

CFG distillation objective. Overall, starting from the base pretrained initialisation θinit, we distill the logits obtained by θinit with CFG into the weights θ without CFG by maximising:

Q(θ) = Ex X

n=1 KL πθ ( |y<n, x) ||πCF Gγ θinit ( |y<n, x) ##

2.2 RL FOR DIVERSITY

Reduced diversity during alignment. Another key drawback of CFG is that it reduces the diversity of generated outputs as γ increases. This occurs because stronger guidance encourages the model to adhere closely to the prompt, leading to outputs that are more similar to each other. This trend also emerges in our CFG distillation procedure (from Section 2.1). We conﬁrm in Figure 2 (right) that diversity across generations from the student steadily decreases as training progresses. More generally, policy collapse is a recurring challenge for alignment strategies; for instance, Kirk et al. (2024) demonstrate that RLHF signiﬁcantly reduces diversity of outputs. This limits exploration of novel and diverse contents, a crucial aspect for creative domains. To address this, we introduce a strategy encouraging the model to generate diverse samples. This requires two components, detailed below: (1) a diversity reward and (2) a diversity-enforcing algorithm.

Diversity reward: negative cosine similarity of embeddings. We ﬁrst introduce a simple method to quantify the diversity between two generations y1 and y2. We embed those generations with a model E and compute their cosine similarity in the embedding space. Then, the diversity is deﬁned as r D(y1, y2) = 1 E(y1) E(y2) E(y1) E(y2) . The key component of this diversity reward is thus the embedding model E. In our text-to-music application from Section 3, drawing inspiration from Futeral et al. (2024), we train a self supervised contrastive model using the semi-hard triplet loss (Schroff et al., 2015) with positive pairs coming from non-overlapping ﬁxed-length chunks of the same audio segment. The model induces an embedding space where segments originating from the same longer audio are mapped to nearby points, while segments from different recordings are mapped further apart. Following Futeral et al. (2024), our embedding model takes as input a 4-seconds audio clip sampled at 16k Hz. The model architecture is based on a compact Vision Transformer (Vi T) (Dosovitskiy et al., 2021). More experimental details can be found in Appendix D.1.

Diversity algorithm: RL for diversity. We now leverage reinforcement learning (RL) to encourage the student policy to generate diverse samples for a given prompt. Using the diversity reward deﬁned above, the diversity objective to maximise is: D(θ) = Ex X Ey1,y2 pθ( |x) [r D(y1, y2)] . (3) As shown in Appendix A, an unbiased estimator of D(θ) is then: r D(y1, y2) log pθ(y1|x) with x X, y1, y2 pθ( |x). (4) Due to the similarity of Equation (4) with policy gradient (Sutton et al., 1999), this diversity can be optimised with standard algorithms such as REINFORCE (Williams, 1992), updating the parameters θ by increasing the likelihood of generations that lead to high (diversity) rewards. Overall, the only difference is that this gradient asks for multiple generations, matching the requirements of multi-sample RL algorithms recently used in RLHF (Ahmadian et al., 2024; Sessa et al., 2024).

2.3 DIVERSITY-REWARDED CFG DISTILLATION

Combining both CFG distillation from Section 2.1 and the diversity reward from Section 2.2, we maximise the following novel diversity-rewarded CFG distillation objective: QD(θ) = Q(θ) + β D(θ), (5)

1Moreover, as we also encourage the generations from the student to optimise a diversity reward, samples from the student will be readily available, making this on-policy distillation possible at no cost.

Published as a conference paper at ICLR 2025

with β the diversity hyperparameter scaling the RL diversity term to the KL quality distillation term. Low values of β (e.g. 0) lead to policies focused on quality, while large values of β (e.g. 15) lead to policies prioritising diversity at a slight cost in terms of quality, as visible in Figure 1 (right).

2.4 MODEL MERGING FOR PARETO-OPTIMAL QUALITY-DIVERSITY

Tuning the hyperparameter β allows for some control over the quality-diversity trade-off before training, but exploring the full spectrum of possibilities would require maintaining a large set of models with different β values. To address this, we leverage model merging, which enables combining the strengths of different models by simply interpolating their weights (Ilharco et al., 2023; Dimitriadis et al., 2023; Ram e et al., 2023). This allows us to efﬁciently adjust the quality-diversity trade-off at deployment time, based on the speciﬁc needs of the user or application, with only two ﬁnetunings. Speciﬁcally, we consider two models ﬁnetuned from a shared pretrained initialisation with different values of β, θq focused on high quality (β = 0) and θd focused on high diversity (high β). We then trade-off between their abilities simply by interpolating between their weights: θLERP = (1 λ) θq + λ θd, (6) with λ [0, 1]. For instance, setting λ close to 0 favors the high-quality model, generating outputs that closely adhere to the prompt, while λ close to 1 favors the high-diversity model, resulting in more exploratory outputs. This lightweight approach uncovers a strong and steerable quality-diversity front of solutions with minimal computational cost, similar to adjusting γ in CFG during inference.

3 EXPERIMENTS

The experiments below are structured to address the following research questions. (1) CFG distillation: can we distill the quality of CFG into the weights of the model, and at what cost in terms of diversity? (2) Diversity RL: can we trade-off between quality and diversity by including a diversity reward during ﬁnetuning? (3) Model merging: can we construct at inference time a strong and steerable quality-diversity front of solutions with weight interpolation?

3.1 TEXT-TO-MUSIC GENERATION: TASK, SETUP AND METRICS

Task. We explore those questions on text-to-music generation, a creative task where quality and diversity are important factors. Indeed, a single music prompt could map to many different valid generations. For instance, a gentle bossa nova piece, a relaxing acoustic guitar melody or an ethereal electronic soundscape would all be valid responses to the prompt A peaceful sunset on a beach . This makes text-to-music generation a great testbed to study quality-diversity trade-offs.

Model and setup. We follow the experimental setup from Cideron et al. (2024), using their LLM transformer-based architecture, where a single autoregressive stage models the semantic and coarse acoustic features. We use their Music RL-R checkpoint as the base model θinit, initialising all our experiments. We use the prompt dataset described in Section 4.1 from Cideron et al. (2024), combining multiple sources: synthetic prompts generated from the La MDA model (Thoppilan et al., 2022), prompts collected via user interactions (Cideron et al., 2024), and prompts derived from Music Caps (Agostinelli et al., 2023). For CFG, we set γ = 3 and use the negative prompt Bad audio quality. , as it performed best on the base model (cf. Appendix B). For the diversity reward, we train a music embedding model E by contrastive self-supervised learning (cf. Appendix D.1).

Training details. The partial generations for distillation are sampled from the student in an online fashion (Agarwal et al., 2024) with temperature T = 0.99. The RL algorithm is a variant of REINFORCE (Williams, 1992) with a baseline for variance reduction. We use a batch size of 128 and a learning rate of 0.00015 for all our ﬁnetunings.

Metrics. We evaluate the quality of the generations with two metrics: (1) the Mu Lan score (Agostinelli et al., 2023), a text adherence metric based on the Mu Lan embeddings (Huang et al., 2022), and (2) the User Preference score (Cideron et al., 2024), learned on 300k pairwise preferences of music generations to approximate the average human opinion score. We then deﬁne the general quality score as squality = ω s Mu Lan + s User Preference, using ω = 5 to enforce a similar range of variations for the two scores. Critically, those metrics are not used during training. In Section 3.5, we conﬁrm the insights obtained from this quality metric through human evaluation.

Published as a conference paper at ICLR 2025

3.2 CFG DISTILLATION FOR QUALITY

We want to validate whether or not it is possible to match by distillation (at training time) the performances of CFG-augmentation (at inference time) without its inference cost overhead.

CFG distillation optimisation. Figure 2 (left) shows the evolution of the KL divergence from Equation (2) between the policy and the CFG-augmented policy along training. When the objective is only to minimise the KL (i.e. β = 0), the KL decreases to its lowest value, implying that the CFG-free model learns to approximate the CFG logits. β is scheduled to linearly increase until step 1000, which explains why the KL increases after this step for β > 0.

CFG distillation improves quality. Figure 2 (middle) shows that training signiﬁcantly improves quality matching the performances of the base model augmented with CFG and γ = 3. This suggests that CFG s quality can be effectively distilled into the model s weights, a ﬁnding conﬁrmed by human evaluations in Section 3.5.

CFG distillation reduces diversity. Figure 2 (right) shows that the CFG-distilled model (β = 0) suffers from a decrease in diversity compared to its initialisation, the CFG-free base model. Notably, after only 1k training steps, CFG-distillation reaches the same level of diversity (around 0.37) than the CFG-augmented base model (grey horizontal line for γ = 3). As a side note, Figure 1 (right), Figure 4 (right) and Figure 6 (in Appendix B) show that higher values of γ for CFG further reduce diversity. In the next subsection, we assess the efﬁciency of our diversity reward to promote diversity.

0 1000 2000 3000 4000 5000 # steps

0 1000 2000 3000 4000 5000 # steps

0 1000 2000 3000 4000 5000 # steps

CFG upper bound

Figure 2: Left. Evolution of the KL divergence between the CFG-distilled student and the CFGaugmented teacher along training. GKD distillation alone (β = 0) decreases the KL between the two policies. Middle. Evolution of the quality along training, showing improved quality for all selected values of β. Right. Evolution of the diversity across generations along training, showing that CFG distillation alone reduces diversity, but that using a diversity reward (β = 0) can actually increase it. The CFG line shows the quality/diversity performance of the CFG-augmented base model serving as a teacher. The upper-bound line indicates the mean diversity of two generations (from the base model) for two different prompts.

3.3 RL FOR DIVERSITY

Rewarding diversity. We now analyse the results obtained when including the diversity reward from Equation (3), scaled by the hyperparameter β {0, 5, 10, 15} in the joint objective. As visible in Figure 2 (right), this increases the diversity across generations along training. For β = 10 or β = 15, diversity actually gets larger, and closer to the mean diversity between two generations for two different prompts, denoted as an empirical upper-bound .

Quality-diversity trade-off. Figure 2 (left) shows that these diversity gains come at the cost of increased KL. This is because higher values of β put more emphasis on the diversity term from Equation (3), hindering the minimisation of the KL distillation loss from Equation (2). As a consequence, policies ﬁnetuned with larger values of β have lower quality in Figure 2 (middle). Those insights are summarised in Figure 1 (right), where the quality-diversity trade-off is plotted along training for the different values of β.

Published as a conference paper at ICLR 2025

3.4 MODEL MERGING FOR PARETO-OPTIMAL QUALITY-DIVERSITY

3.6 3.8 4.0 4.2 4.4 4.6 Quality

LERP(0,15) LERP(0,10) LERP(0,5) LERP(0,base) LERP(0,5,10,15) uniform CFG on base model CFG on CFG-distilled model Temperature on base model Temperature on CFG-distilled model

Figure 3: Quality-diversity trade-off for multiple strategies. The ﬁrst four lines linear interpolate (LERP) between the quality-focused model (β = 0) and more diverse models (those trained with β > 0, or the base model), sliding λ between 0 and 1 with a step of 0.05. We also report the performance from the uniform (λ = 1

4) averaging of the four models ﬁnetuned with different β, denoted as LERP(0, 5, 10, 15) uniform . We include inference-time baseline strategies CFG (when varying γ) and temperature sampling (when varying the temperature T) applied either on the base model or on the CFG-distilled model.

Quality-diversity trade-off. In Figure 3 we interpolate between the model with highest quality and the model with highest diversity: by sliding the coefﬁcient λ from 0 (only the quality model) to 1 (only the diversity model), we construct fronts of solutions that surpass the performance of individual models ﬁnetuned for intermediate β values. In particular, interpolating towards the solution with maximum diversity (β = 15) yields a strong and steerable Pareto front describing the full range of possible trade-offs. This indicates that model merging achieves a higher quality for a given level of diversity, and vice versa, consistent with previous studies that show how merged models perform better than their non-merged counterparts (Wortsman et al., 2022a; Ram e et al., 2022). This is achieved with minimal overhead: only two RL ﬁnetuning runs are required, and merging the weights involves negligible computational cost. If only one ﬁnetuning is possible, then linearly interpolating towards the base initialisation can also help, by recovering features from pretraining (Wortsman et al., 2022b); in particular we note that LERP(0, base) outperforms the CFG-augmented base model. If more compute is available for training and we can do four ﬁnetunings, then merging them uniformly performs even better, as visible in Figure 3 where the grey cross for LERP(0, 5, 10, 15) uniform is (slightly) above the other fronts. This suggests that we can merge an arbitrary number of models.

0.0 0.2 0.4 0.6 0.8 1.0 3.6

0.0 0.2 0.4 0.6 0.8 1.0 0.30

1 2 3 4 5 6 7 3.6

1 2 3 4 5 6 7 0.30

Figure 4: Left. Linear interpolation between the weights of a model focused on quality (β = 0) and a model focused on diversity (β = 15), sliding the interpolating coefﬁcient λ between 0 and 1. The dashed diagonal represents the expected values if abilities were traded-off linearly between those two models. While the diversity stays close to the diagonal, the quality remains above it, showing the beneﬁts of model merging. Right. For comparison, we also include the results for CFG when sliding γ between 1 and 7, performing worse than merged models.

Published as a conference paper at ICLR 2025

Baselines. Figure 3 also displays the results for two inference-time baselines: CFG and temperature sampling. Applied to the base model, these strategies are Pareto-dominated by CFG-distilled models with diversity rewards. When applied to the already CFG-distilled model, they do not signiﬁcantly improve quality nor diversity.

Quality-diversity as a function of λ. Figure 4 (left) clariﬁes the effect of weight interpolation on quality and diversity as we slide λ between 0 and 1. Diversity stays close to the diagonal (representing the expected diversity) while the quality is consistently above the diagonal. This highlights the ability to signiﬁcantly increase diversity with only minor quality compromises via model merging.

3.5 HUMAN EVALUATION

Protocol. To conclude our experiments, we validate via human evaluation the improved qualitydiversity trade-offs. We evaluate ﬁve models: the base model (we expect low quality but high diversity), the base model with CFG γ = 3 (high quality but low diversity), the CFG-distilled model for β = 0 (high quality but low diversity) and β = 15 (medium quality and high diversity), and ﬁnally LERP(0, 15) merging uniformly with λ = 0.5 the two previous models (high quality and medium diversity). For quality (i.e. acoustic quality, text adherence, and musicality), we use the same evaluation protocol as in Cideron et al. (2024): the raters see two generations from different models coming from the same text prompt and rate them on a scale from 1 to 5. We use 100 different prompts and each one is seen by 3 different raters. For diversity, we rely on a similar protocol except that the raters see two pairs of music clips generated from the same text prompts and are rate how diverse the pairs of generations are. We use 50 different prompts and each one is rated by 3 different raters.

LERP(0, 15)

LERP(0, 15)

0.42 0.34 0.32 0.28

0.58 0.45 0.41 0.39

0.66 0.55 0.50 0.48

0.68 0.59 0.50 0.49

0.72 0.61 0.52 0.51

LERP(0, 15)

LERP(0, 15)

0.50 0.81 0.78 0.72

0.50 0.79 0.73 0.68

0.19 0.21 0.48 0.42

0.22 0.27 0.52 0.43

0.28 0.32 0.58 0.57

Figure 5: Left. Side-by-side human evaluation for quality. Right. Side-by-side human evaluation for diversity. The score corresponds to the win rate of model A over model B, computed as W +T/2

W +L+T with W the number of wins of A over B, T the number of ties, L the number of losses of A against B. This conﬁrms that our approach improves the quality-diversity trade-off. For instance, the merged model LERP(0, 15) generates music with higher diversity than the CFG-augmented base model (γ = 3) in 57% of the comparisons, while being rated as more qualitative half of the time (51%).

Human evaluation for quality. Figure 5 (left) presents the side-by-side win rate for quality. The CFG-distilled model (i.e. β = 0) performs on par with the CFG-augmented base model (i.e. CFG) with a win rate of 0.50; they both outperform the base model without CFG (ﬁrst line), with 0.66 and 0.68 win rate respectively. LERP(0, 15) achieves win rate above 0.50 against all models.

Human evaluation for diversity. Figure 5 (right) presents the side-by-side win rate for diversity. The CFG-augmented base model (i.e. CFG) and the CFG-distilled model (i.e. β = 0) are also evaluated similarly in terms of diversity (0.52 win rate for the former); yet, they are consistently seen less diverse than the three other models. Notably, the CFG-distilled model with diversity reward (i.e. β = 15) has win rate of 0.73 and 0.79 against them. As a side note, this β = 15 model performs on par with the base model (0.50 win rate), though the quantitative metrics from Figure 2 (right) suggested that it was actually more diverse; in Section 5, we relate this discrepancy to the standard reward hacking (Amodei et al., 2016; Gao et al., 2023) phenomenon of RL.

Published as a conference paper at ICLR 2025

Human evaluation for quality-diversity trade-off. Human evaluations conﬁrm the effectiveness of our approach in improving the quality-diversity trade-off. The diversity-focused model (β = 15) exhibits higher quality than the base model while maintaining comparable diversity. Importantly, the merged model LERP(0, 15) exhibits higher diversity than the CFG-augmented base model while maintaining comparable quality, and without incurring the additional inference cost of CFG.

Qualitative analysis. To facilitate qualitative analysis, we host music generated from all evaluated models on the website google-research.github.io/seanet/musiclm/diverse music. An examination of generic prompts like Rock song. conﬁrms that CFG improves quality (i.e. less acoustic artifacts, better musicality) but reduces diversity, as generated guitars tend to be very similar across different generations. In contrast to the CFG-augmented model, the β = 15 model generates a wider variety of rhythms while still demonstrating a clear improvement in quality over the base model. On prompts like Opera singer or Male choir harmony , the quality-focused model β = 0 generates conventional outputs, while the diverse model β = 15 produces more unconventional and creative results, sometimes leading to unexpected elements like unusual instrumentation (e.g. drums). By averaging the weights of these two models (LERP), we can effectively balance these qualities, generating high-quality music that is both faithful to the prompt and more creative.

4 RELATED WORK

Classiﬁer-free guidance (CFG). Initially introduced for diffusion models (Ho & Salimans, 2022) and later adapted for autoregressive LLMs (Gafni et al., 2022; Sanchez et al., 2023; Wings, 2022), CFG has found widespread application in various generative domains, including image (Yu et al., 2022; Saharia et al., 2022; Rombach et al., 2022; Nichol et al., 2021; Ramesh et al., 2022), video (Blattmann et al., 2023; Ho et al., 2022), and audio (Kreuk et al., 2023; Copet et al., 2023) generation. However, the detrimental impact of CFG on diversity is well-documented (Dhariwal & Nichol, 2021; Kreuk et al., 2023; Nichol et al., 2021), limiting its application when exploration is key.

Distillation. Knowledge distillation (Hinton et al., 2015) is emerging as a powerful technique to train state-of-the-art models (Gemma Team et al., 2024). By transferring knowledge from a teacher, the student can perform better than with standard training on the same data (Sanh et al., 2019; Lin et al., 2020; Gu et al., 2023). In the context of diffusion models, CFG distillation was applied to drastically reduce the inference time (Meng et al., 2023; Luo et al., 2023; Yin et al., 2024; Saito et al., 2024; Bai et al., 2023; Novack et al., 2024). In our work, we apply the CFG distillation idea to LLMs where we employ a single-stage on-policy distillation procedure (Agarwal et al., 2024) to distill a CFG-augmented LLM, while introducing a novel diversity-promoting RL algorithm and model merging for improved quality-diversity trade-off.

Quality-diversity in LLMs. Zhang et al. (2021) compare the quality-diversity trade-offs of various inference-time strategies for LLMs, including temperature sampling, top-k sampling (Fan et al., 2018), and nucleus sampling (Holtzman et al., 2020). These methods perform similarly except when quality is prioritized over diversity, where nucleus sampling performs best. Regarding ﬁnetuning strategies, Kirk et al. (2024); Mohammadi (2024); Chaudhari et al. (2024); Li et al. (2024) have shown that RLHF can negatively impact the diversity of generated text, often measured with metrics like BLEU. To the best of our knowledge, we are the ﬁrst to introduce an RL algorithm to optimize diversity, which could potentially also solve those reductions in diversity in RLHF. In contrast, existing qualitydiversity algorithms (Lehman & Stanley, 2011; Mouret & Clune, 2015; Cully et al., 2015; Cideron et al., 2020; Ding et al., 2024) aim at ﬁnding a population of agents with both high-quality and diverse behaviors. Most similarly, Li et al. (2016); Zhang et al. (2018) also tried to increase the diversity across generations produced by a single agent, but measured by the number of distinct n-grams, and optimised with objectives based on mutual information (Shannon, 1948).

Model merging for Pareto-optimality. Model merging via weight averaging (Utans, 1996) has two main applications in deep learning. First, it increases generalisation by reducing variance (Wortsman et al., 2022a; Ram e et al., 2022) and memorisation (Lin et al., 2024; Ram e et al., 2024). Second, it combines their strengths (Ilharco et al., 2023; Dimitriadis et al., 2023; Wang et al., 2024), as employed in Ram e et al. (2023) for multi-objective RL, where policies ﬁnetuned with different rewards are interpolated. Similarly, we interpolate between policies where only one of them is rewarded for diversity. Despite recent theoretical efforts to explain the empirical success of model merging (Ferbach et al., 2024; Ram e et al., 2024), a complete understanding remains elusive.

Published as a conference paper at ICLR 2025

Music generation. Diffusion models (Huang et al., 2023; Schneider et al., 2023; Liu et al., 2023) and Transformers (Agostinelli et al., 2023; Copet et al., 2023) are now the state-of-the-art architectures for music generation. In this work, we leverage the Transformer-based approach from Agostinelli et al. (2023), casting music generation as a categorical prediction task in the discrete token space provided by a neural audio codec (Zeghidour et al., 2022; D efossez et al., 2022). RL-ﬁnetuning for music generation has previously been explored in Jaques et al. (2017); Guimaraes et al. (2017); Kotecha (2018); Latif et al. (2023); Cideron et al. (2024). In contrast, we are the ﬁrst to distill CFG, to enforce diversity through RL, to apply model merging or to improve the quality-diversity trade-off for music.

5 DISCUSSIONS AND LIMITATIONS

Ampliﬁcation and distillation. In this work, we leverage a teacher-student framework where the teacher is an augmented version of the student model itself. This allows to distill the quality improvements obtained from a potent but expensive inference strategy, eliminating the overhead at deployment. This echoes the principles of iterated distillation and ampliﬁcation (IDA) (Cotra, 2018), where the model iteratively learns on data generated by an augmented version of itself. Examples of this can be seen in Alpha Go (Silver et al., 2016), achieving superhuman performance by distilling the knowledge of MCTS (Kocsis & Szepesv ari, 2006), or in recent LLM works (Wang et al., 2023; Yu et al., 2024), distilling System 2 into System 1 by imitating ofﬂine predictions obtained from Chain-of-Thought (Wei et al., 2022). Given the effectiveness of scaling inference-time strategies (Snell et al., 2024), we anticipate wider adoption of such ampliﬁcation-then-distillation techniques.

Extension to other models or modalities. The three main components of our approach distillation of an inference strategy, RL with a diversity reward, and model merging could be readily adapted to other generative architectures. For instance, model merging is already a popular strategy for diffusion models (Purplekeyboard, 2022; Biggs et al., 2024). Additionally, while our focus is on text-to-music, similar strategies could be applied to other setups involving text, image or video generation. For applications where retaining the CFG coefﬁcient is crucial our approach can easily be extended with methods like CLP (Wang et al., 2024) . It would only require to swap the underlying RL algorithm with the CLP algorithm.

Diversity measures. We used negative cosine similarity between embeddings as our diversity measure. Critically, we demonstrate in Appendix D.2 its superior correlation with human perception of diversity compared to token-level entropy: we suspect this is because diversity should be assessed at the sentence level for creative tasks. Yet, qualitative inspection suggests that this diversity measure can still be hacked (Amodei et al., 2016; Gao et al., 2023). For example, models ﬁnetuned to maximise diversity sometimes generate excessive background drums, likely due to a bias of the underlying embedding model E. This could be changed by making E more invariant to background drums; alternatively, inspired by Ding et al. (2024), we could learn directly a human feedback (or AI-feedback) diversity embedding, by asking humans (or LLMs) to rate the similarity of generations. More broadly, relying on a single diversity metric may be insufﬁcient, as diversity can manifest in various ways (Levy et al., 2024); in music, diversity could focus on variations in voice, key, tempo, or other elements; in NLP, a summarization task may prioritise syntactic diversity over semantic diversity. By incorporating multiple diversity rewards, each targeting a speciﬁc variation, and then leveraging model merging, we could achieve ﬁne-grained control over diversity at deployment time.

6 CONCLUSION

In this work, we introduced diversity-rewarded CFG distillation, a novel ﬁnetuning approach to enhance the quality-diversity trade-off in generative models. First, we online distilled CFG, eliminating its computational overhead at inference time. Second, to preserve and even enhance diversity across generations, we incorporated an RL procedure that optimises a diversity reward based on similarity embeddings. Third, we leveraged model merging to enable dynamic control over the quality-diversity trade-off at deployment time. Through extensive experiments on text-to-music generation, we demonstrated the validity of our strategy, with our ﬁnetuned-then-merged model performing best according to human evaluations. We believe that our work provides a promising foundation for future research exploring tasks where alignment and creativity are key, and that it can be easily extended to setups beyond text-to-music generation.

Published as a conference paper at ICLR 2025

Rishabh Agarwal, Nino Vieillard, Yongchao Zhou, Piotr Stanczyk, Sabela Ramos Garea, Matthieu Geist, and Olivier Bachem. On-policy distillation of language models: Learning from selfgenerated mistakes. In ICLR, 2024. (pp. 2, 3, 4, 5, and 9)

Andrea Agostinelli, Timo I Denk, Zal an Borsos, Jesse Engel, Mauro Verzetti, Antoine Caillon, Qingqing Huang, Aren Jansen, Adam Roberts, Marco Tagliasacchi, et al. Musiclm: Generating music from text. ar Xiv preprint, 2023. (pp. 1, 3, 5, 10, 20, 22, and 23)

Arash Ahmadian, Chris Cremer, Matthias Gall e, Marzieh Fadaee, Julia Kreutzer, Ahmet Ust un, and Sara Hooker. Back to basics: Revisiting REINFORCE style optimization for learning from human feedback in LLMs. ar Xiv preprint, 2024. (p. 4)

Dario Amodei, Chris Olah, Jacob Steinhardt, Paul Christiano, John Schulman, and Dan Man e. Concrete problems in AI safety. ar Xiv preprint, 2016. (pp. 8 and 10)

Yatong Bai, Trung Dang, Dung Tran, Kazuhito Koishida, and Somayeh Sojoudi. Accelerating diffusion-based text-to-audio generation with consistency distillation. ar Xiv, 2023. (pp. 3 and 9)

Samy Bengio, Oriol Vinyals, Navdeep Jaitly, and Noam Shazeer. Scheduled sampling for sequence prediction with recurrent neural networks. In Neur IPS, 2015. (p. 4)

Benjamin Biggs, Arjun Seshadri, Yang Zou, Achin Jain, Aditya Golatkar, Yusheng Xie, Alessandro Achille, Ashwin Swaminathan, and Stefano Soatto. Diffusion soup: Model merging for text-toimage diffusion models. ar Xiv preprint, 2024. (p. 10)

Andreas Blattmann, Robin Rombach, Huan Ling, Tim Dockhorn, Seung Wook Kim, Sanja Fidler, and Karsten Kreis. Align your latents: High-resolution video synthesis with latent diffusion models. In CVPR, 2023. (p. 9)

Zal an Borsos, Rapha el Marinier, Damien Vincent, Eugene Kharitonov, Olivier Pietquin, Matthew Shariﬁ, Dominik Roblek, Olivier Teboul, David Grangier, Marco Tagliasacchi, and Neil Zeghidour. Audio LM: A language modeling approach to audio generation. TASLP, 2023. (p. 1)

Gavin Brown, Jeremy L Wyatt, and Peter Tiˇno. Managing diversity in regression ensembles. JMLR, 2005. (p. 1)

Shreyas Chaudhari, Pranjal Aggarwal, Vishvak Murahari, Tanmay Rajpurohit, Ashwin Kalyan, Karthik Narasimhan, Ameet Deshpande, and Bruno Castro da Silva. Rlhf deciphered: A critical analysis of reinforcement learning from human feedback for llms. ar Xiv preprint, 2024. (p. 9)

Geoffrey Cideron, Thomas Pierrot, Nicolas Perrin, Karim Beguir, and Olivier Sigaud. Qd-rl: Efﬁcient mixing of quality and diversity in reinforcement learning. ar Xiv preprint, 2020. (p. 9)

Geoffrey Cideron, Sertan Girgin, Mauro Verzetti, Damien Vincent, Matej Kastelic, Zal an Borsos, Brian Mc Williams, Victor Ungureanu, Olivier Bachem, Olivier Pietquin, Matthieu Geist, L eonard Hussenot, Neil Zeghidour, and Andrea Agostinelli. Music RL: Aligning music generation to human preferences. In ICML, 2024. (pp. 1, 3, 5, 8, and 10)

Jade Copet, Felix Kreuk, Itai Gat, Tal Remez, David Kant, Gabriel Synnaeve, Yossi Adi, and Alexandre D efossez. Simple and controllable music generation. In Neur IPS, 2023. (pp. 1, 9, 10, and 23)

Ajeya Cotra. Iterated Distillation and Ampliﬁcation, 2018. URL https://ai-alignment.c om/iterated-distillation-and-amplification-157debfd1616. (p. 10)

Antoine Cully, Jeff Clune, Danesh Tarapore, and Jean-Baptiste Mouret. Robots that can adapt like animals. Nature, 2015. (p. 9)

Alexandre D efossez, Laurent Mazar e, Manu Orsini, Am elie Royer, Patrick P erez, Herv e J egou, Edouard Grave, and Neil Zeghidour. Moshi: a speech-text foundation model for real-time dialogue. (p. 1)

Published as a conference paper at ICLR 2025

Alexandre D efossez, Jade Copet, Gabriel Synnaeve, and Yossi Adi. High ﬁdelity neural audio compression. TMLR, 2022. (p. 10)

Prafulla Dhariwal and Alexander Nichol. Diffusion models beat gans on image synthesis. In Neur IPS, 2021. (pp. 1 and 9)

Nikolaos Dimitriadis, Pascal Frossard, and Franc ois Fleuret. Pareto manifold learning: Tackling multiple tasks via ensembles of single-task models. In ICML, 2023. (pp. 5 and 9)

Li Ding, Jenny Zhang, Jeff Clune, Lee Spector, and Joel Lehman. Quality diversity through human feedback: Towards open-ended diversity-driven optimization. In ICML, 2024. (pp. 9 and 10)

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR, 2021. (pp. 4 and 20)

Benjamin Elizalde, Soham Deshmukh, Mahmoud Al Ismail, and Huaming Wang. Clap learning audio concepts from natural language supervision. In ICASSP, 2023. (pp. 22 and 23)

Angela Fan, Mike Lewis, and Yann Dauphin. Hierarchical neural story generation. In Iryna Gurevych and Yusuke Miyao (eds.), ACL, 2018. (p. 9)

Damien Ferbach, Baptiste Goujaud, Gauthier Gidel, and Aymeric Dieuleveut. Proving linear mode connectivity of neural networks via optimal transport. In AISTATS, 2024. (p. 9)

Jonathan Frankle, Gintare Karolina Dziugaite, Daniel M. Roy, and Michael Carbin. Linear mode connectivity and the lottery ticket hypothesis. In ICML, 2020. (p. 2)

Markus Freitag and Yaser Al-Onaizan. Beam search strategies for neural machine translation. ar Xiv preprint, 2017. (p. 1)

Dan Friedman and Adji Bousso Dieng. The vendi score: A diversity evaluation metric for machine learning. TMLR, 2022. (p. 21)

Matthieu Futeral, Andrea Agostinelli, Marco Tagliasacchi, Neil Zeghidour, and Eugene Kharitonov. Mad speech: Measures of acoustic diversity of speech. ar Xiv preprint, 2024. (pp. 4 and 20)

Oran Gafni, Adam Polyak, Oron Ashual, Shelly Sheynin, Devi Parikh, and Yaniv Taigman. Makea-scene: Scene-based text-to-image generation with human priors. In Shai Avidan, Gabriel J. Brostow, Moustapha Ciss e, Giovanni Maria Farinella, and Tal Hassner (eds.), ECCV, 2022. (p. 9)

Leo Gao, John Schulman, and Jacob Hilton. Scaling laws for reward model overoptimization. In ICML, 2023. (pp. 8 and 10)

Google Gemini Team. Gemini: A family of highly capable multimodal models. 2023. (p. 1)

Gemma Team, Morgane Riviere, Shreya Pathak, Pier Giuseppe Sessa, Cassidy Hardin, Surya Bhupatiraju, L eonard Hussenot, Thomas Mesnard, Bobak Shahriari, Alexandre Ram e, et al. Gemma 2: Improving open language models at a practical size. ar Xiv preprint, 2024. (pp. 4 and 9)

Yuxian Gu, Li Dong, Furu Wei, and Minlie Huang. Knowledge distillation of large language models. ar Xiv preprint, 2023. (p. 9)

Azalea Gui, Hannes Gamper, Sebastian Braun, and Dimitra Emmanouilidou. Adapting frechet audio distance for generative music evaluation. In ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1331 1335. IEEE, 2024. (p. 23)

Gabriel Lima Guimaraes, Benjamin Sanchez-Lengeling, Carlos Outeiral, Pedro Luis Cunha Farias, and Al an Aspuru-Guzik. Objective-reinforced generative adversarial networks (organ) for sequence generation models. ar Xiv preprint, 2017. (p. 10)

Sil Hamilton. Detecting mode collapse in language models via narration. ar Xiv preprint, 2024. (p. 1)

Published as a conference paper at ICLR 2025

Shawn Hershey, Sourish Chaudhuri, Daniel PW Ellis, Jort F Gemmeke, Aren Jansen, R Channing Moore, Manoj Plakal, Devin Platt, Rif A Saurous, Bryan Seybold, et al. Cnn architectures for large-scale audio classiﬁcation. In ICASSP, 2017. (pp. 22 and 23)

Geoffrey Hinton, Oriol Vinyals, and Jeffrey Dean. Distilling the knowledge in a neural network. In Neur IPS, 2015. (pp. 2, 3, and 9)

Jonathan Ho and Tim Salimans. Classiﬁer-free diffusion guidance. ar Xiv preprint, 2022. (pp. 1, 3, and 9)

Jonathan Ho, William Chan, Chitwan Saharia, Jay Whang, Ruiqi Gao, Alexey Gritsenko, Diederik P Kingma, Ben Poole, Mohammad Norouzi, David J Fleet, et al. Imagen video: High deﬁnition video generation with diffusion models. ar Xiv preprint, 2022. (pp. 1 and 9)

Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, and Yejin Choi. The curious case of neural text degeneration. In ICLR, 2020. (p. 9)

Qingqing Huang, Aren Jansen, Joonseok Lee, Ravi Ganti, Judith Yue Li, and Daniel P. W. Ellis. Mu Lan: A joint embedding of music audio and natural language. In ISMIR, 2022. (p. 5)

Qingqing Huang, Daniel S Park, Tao Wang, Timo I Denk, Andy Ly, Nanxin Chen, Zhengdong Zhang, Zhishuai Zhang, Jiahui Yu, Christian Frank, et al. Noise2music: Text-conditioned music generation with diffusion models. ar Xiv preprint, 2023. (p. 10)

Gabriel Ilharco, Marco Tulio Ribeiro, Mitchell Wortsman, Suchin Gururangan, Ludwig Schmidt, Hannaneh Hajishirzi, and Ali Farhadi. Editing models with task arithmetic. In ICLR, 2023. (pp. 2, 5, and 9)

Natasha Jaques, Shixiang Gu, Dzmitry Bahdanau, Jos e Miguel Hern andez-Lobato, Richard E Turner, and Douglas Eck. Sequence tutor: Conservative ﬁne-tuning of sequence generation models with kl-control. In ICML, 2017. (p. 10)

Robert Kirk, Ishita Mediratta, Christoforos Nalmpantis, Jelena Luketina, Eric Hambro, Edward Grefenstette, and Roberta Raileanu. Understanding the effects of RLHF on LLM generalisation and diversity. In ICLR, 2024. (pp. 1, 4, and 9)

Levente Kocsis and Csaba Szepesv ari. Bandit based monte-carlo planning. In ECML, 2006. (pp. 1

Nikhil Kotecha. Bach2bach: generating music using a deep reinforcement learning approach. ar Xiv preprint, 2018. (p. 10)

Felix Kreuk, Gabriel Synnaeve, Adam Polyak, Uriel Singer, Alexandre D efossez, Jade Copet, Devi Parikh, Yaniv Taigman, and Yossi Adi. Audiogen: Textually guided audio generation. In ICLR, 2023. (pp. 1, 3, and 9)

Siddique Latif, Heriberto Cuay ahuitl, Farrukh Pervez, Fahad Shamshad, Haﬁz Shehbaz Ali, and Erik Cambria. A survey on deep reinforcement learning for audio-based applications. Artiﬁcial Intelligence Review, 2023. (p. 10)

Joel Lehman and Kenneth O Stanley. Evolving a diversity of virtual creatures through novelty search and local competition. In GECCO, 2011. (pp. 2 and 9)

Avivit Levy, B. Riva Shalom, and Michal Chalamish. A guide to similarity measures. ar Xiv preprint, 2024. (p. 10)

Jiwei Li, Michel Galley, Chris Brockett, Jianfeng Gao, and Bill Dolan. A diversity-promoting objective function for neural conversation models. In NAACL, 2016. (p. 9)

Margaret Li, Weijia Shi, Artidoro Pagnoni, Peter West, and Ari Holtzman. Predicting vs. acting: A trade-off between world modeling & agent modeling. ar Xiv preprint, 2024. (p. 9)

Alexander Lin, Jeremy Wohlwend, Howard Chen, and Tao Lei. Autoregressive knowledge distillation through imitation learning. ar Xiv preprint, 2020. (pp. 4 and 9)

Published as a conference paper at ICLR 2025

Yong Lin, Lu Tan, Yifan Hao, Honam Wong, Hanze Dong, Weizhong Zhang, Yujiu Yang, and Tong Zhang. Spurious feature diversiﬁcation improves out-of-distribution generalization. In ICLR, 2024. (p. 9)

Haohe Liu, Zehua Chen, Yi Yuan, Xinhao Mei, Xubo Liu, Danilo P. Mandic, Wenwu Wang, and Mark D. Plumbley. Audioldm: Text-to-audio generation with latent diffusion models. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett (eds.), ICML, 2023. (p. 10)

Simian Luo, Yiqin Tan, Longbo Huang, Jian Li, and Hang Zhao. Latent consistency models: Synthesizing high-resolution images with few-step inference. ar Xiv, 2023. (pp. 3 and 9)

Ilaria Manco, Benno Weck, Seungheon Doh, Minz Won, Yixiao Zhang, Dmitry Bogdanov, Yusong Wu, Ke Chen, Philip Tovstogan, Emmanouil Benetos, et al. The song describer dataset: a corpus of audio captions for music-and-language evaluation. ar Xiv preprint ar Xiv:2311.10057, 2023. (p. 23)

Chenlin Meng, Robin Rombach, Ruiqi Gao, Diederik Kingma, Stefano Ermon, Jonathan Ho, and Tim Salimans. On distillation of guided diffusion models. In CVPR, 2023. (pp. 1 and 9)

Behnam Mohammadi. Creativity has left the chat: The price of debiasing language models. ar Xiv preprint, 2024. (pp. 1 and 9)

Emad Mostaque. Tweet on the examples of negative prompting, 2023. URL https://x.com/EM ostaque/status/1596905782859436033. (p. 18)

Jean-Baptiste Mouret and Jeff Clune. Illuminating search spaces by mapping elites. ar Xiv preprint, 2015. (pp. 2 and 9)

Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob Mc Grew, Ilya Sutskever, and Mark Chen. Glide: Towards photorealistic image generation and editing with text-guided diffusion models. In ICML, 2021. (p. 9)

Zachary Novack, Julian Mc Auley, Taylor Berg-Kirkpatrick, and Nicholas Bryan. Ditto-2: Distilled diffusion inference-time t-optimization for music generation. ar Xiv, 2024. (pp. 3 and 9)

Purplekeyboard. Reddit thread titled What is the point of the endless model merges?, 2022. URL

https://www.reddit.com/r/Stable Diffusion/comments/11kau9d/what i s the point of the endless model merges/. (p. 10)

Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical textconditional image generation with clip latents. ar Xiv preprint, 2022. (pp. 1 and 9)

Alexandre Ram e, Matthieu Kirchmeyer, Thibaud Rahier, Alain Rakotomamonjy, Patrick Gallinari, and Matthieu Cord. Diverse weight averaging for out-of-distribution generalization. In Neur IPS, 2022. (pp. 2, 7, and 9)

Alexandre Ram e, Guillaume Couairon, Mustafa Shukor, Corentin Dancette, Jean-Baptiste Gaya, Laure Soulier, and Matthieu Cord. Rewarded soups: towards pareto-optimal alignment by interpolating weights ﬁne-tuned on diverse rewards. In Neur IPS, 2023. (pp. 3, 5, and 9)

Alexandre Ram e, Nino Vieillard, L eonard Hussenot, Robert Dadashi, Geoffrey Cideron, Olivier Bachem, and Johan Ferret. WARM: On the beneﬁts of weight averaged reward models. In ICML, 2024. (p. 9)

Denise Rey and Markus Neuh auser. Wilcoxon-signed-rank test. International encyclopedia of statistical science, pp. 1658 1659, 2011. (pp. 21 and 22)

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj orn Ommer. Highresolution image synthesis with latent diffusion models. In CVPR, 2022. (p. 9)

Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L Denton, Kamyar Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, et al. Photorealistic text-to-image diffusion models with deep language understanding. Neur IPS, 2022. (p. 9)

Published as a conference paper at ICLR 2025

Koichi Saito, Dongjun Kim, Takashi Shibuya, Chieh-Hsin Lai, Zhi Zhong, Yuhta Takida, and Yuki Mitsufuji. Soundctm: Uniting score-based and consistency models for text-to-sound generation. ar Xiv, 2024. (pp. 3 and 9)

Guillaume Sanchez, Honglu Fan, Alexander Spangher, Elad Levi, Pawan Sasanka Ammanamanchi, and Stella Biderman. Stay on topic with classiﬁer-free guidance. ar Xiv preprint, 2023. (p. 9)

Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter. ar Xiv preprint, 2019. (pp. 3 and 9)

Flavio Schneider, Ojasv Kamal, Zhijing Jin, and Bernhard Sch olkopf. Moˆusai: Text-to-music generation with long-context latent diffusion. ar Xiv preprint, 2023. (p. 10)

Florian Schroff, Dmitry Kalenichenko, and James Philbin. Facenet: A uniﬁed embedding for face recognition and clustering. In CVPR, 2015. (pp. 4 and 20)

Pier Giuseppe Sessa, Robert Dadashi, L eonard Hussenot, Johan Ferret, Nino Vieillard, Alexandre Ram e, Bobak Shariari, Sarah Perrin, Abe Friesen, Geoffrey Cideron, et al. BOND: Aligning llms with best-of-n distillation. ar Xiv preprint, 2024. (p. 4)

Divya Shanmugam, Davis Blalock, Guha Balakrishnan, and John Guttag. Better aggregation in test-time augmentation. In ICCV, 2021. (p. 1)

Claude E Shannon. A mathematical theory of communication. The Bell system technical journal, 1948. (p. 9)

David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. Mastering the game of Go with deep neural networks and tree search. Nature, 2016. (p. 10)

Charlie Snell, Jaehoon Lee, Kelvin Xu, and Aviral Kumar. Scaling llm test-time compute optimally can be more effective than scaling model parameters. ar Xiv preprint, 2024. (p. 10)

Richard S Sutton. Reinforcement learning: An introduction. A Bradford Book, 2018. (p. 17)

Richard S Sutton, David Mc Allester, Satinder Singh, and Yishay Mansour. Policy gradient methods for reinforcement learning with function approximation. NIPS, 1999. (p. 4)

Romal Thoppilan, Daniel De Freitas, Jamie Hall, Noam Shazeer, Apoorv Kulshreshtha, Heng-Tze Cheng, Alicia Jin, Taylor Bos, Leslie Baker, Yu Du, et al. Lamda: Language models for dialog applications. ar Xiv preprint, 2022. (p. 5)

Joachim Utans. Weight averaging for neural networks and local resampling schemes. In AAAI, 1996. (pp. 2 and 9)

Kaiwen Wang, Rahul Kidambi, Ryan Sullivan, Alekh Agarwal, Christoph Dann, Andrea Michi, Marco Gelmi, Yunxuan Li, Raghav Gupta, Avinava Dubey, et al. Conditioned language policy: A general framework for steerable multi-objective ﬁnetuning. In EMNLP, 2024. (pp. 9 and 10)

Peifeng Wang, Zhengyang Wang, Zheng Li, Yifan Gao, Bing Yin, and Xiang Ren. Scott: Selfconsistent chain-of-thought distillation. In ACL, 2023. (p. 10)

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Ed Chi, Quoc Le, and Denny Zhou. Chain-of-Thought prompting elicits reasoning in large language models. In Neur IPS, 2022. (p. 10)

Ronald J Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Reinforcement learning, 1992. (pp. 4, 5, and 17)

Rivers Have Wings. Tweet on classiﬁer-free guidance for autoregressive models, 2022. URL

https://x.com/Rivers Have Wings/status/1478093658716966912. (p. 9)

Mitchell Wortsman, Gabriel Ilharco, Samir Yitzhak Gadre, Rebecca Roelofs, Raphael Gontijo-Lopes, Ari S. Morcos, Hongseok Namkoong, Ali Farhadi, Yair Carmon, Simon Kornblith, and Ludwig Schmidt. Model soups: averaging weights of multiple ﬁne-tuned models improves accuracy without increasing inference time. In ICML, 2022a. (pp. 2, 7, and 9)

Published as a conference paper at ICLR 2025

Mitchell Wortsman, Gabriel Ilharco, Jong Wook Kim, Mike Li, Hanna Hajishirzi, Ali Farhadi, Hongseok Namkoong, and Ludwig Schmidt. Robust ﬁne-tuning of zero-shot models. In CVPR, 2022b. (p. 7)

Tianwei Yin, Micha el Gharbi, Taesung Park, Richard Zhang, Eli Shechtman, Fredo Durand, and William T Freeman. Improved distribution matching distillation for fast image synthesis. ar Xiv, 2024. (pp. 3 and 9)

Jiahui Yu, Yuanzhong Xu, Jing Yu Koh, Thang Luong, Gunjan Baid, Zirui Wang, Vijay Vasudevan, Alexander Ku, Yinfei Yang, Burcu Karagol Ayan, Ben Hutchinson, Wei Han, Zarana Parekh, Xin Li, Han Zhang, Jason Baldridge, and Yonghui Wu. Scaling autoregressive models for content-rich text-to-image generation. TMLR, 2022. (p. 9)

Ping Yu, Jing Xu, Jason Weston, and Ilia Kulikov. Distilling system 2 into system 1. ar Xiv preprint, 2024. (p. 10)

Neil Zeghidour, Alejandro Luebs, Ahmed Omran, Jan Skoglund, and Marco Tagliasacchi. Soundstream: An end-to-end neural audio codec. Trans. Audio Speech Lang. Process., 30, 2022. (p. 10)

Hugh Zhang, Daniel Duckworth, Daphne Ippolito, and Arvind Neelakantan. Trading off diversity and quality in natural language generation. In Anya Belz, Shubham Agarwal, Yvette Graham, Ehud Reiter, and Anastasia Shimorina (eds.), ACL, 2021. (pp. 2, 9, and 20)

Yizhe Zhang, Michel Galley, Jianfeng Gao, Zhe Gan, Xiujun Li, Chris Brockett, and Bill Dolan. Generating informative and diverse conversational responses via adversarial information maximization. Neur IPS, 2018. (p. 9)

Published as a conference paper at ICLR 2025

Diversity-Rewarded CFG Distillation

Supplementary material

A POLICY GRADIENT FOR THE DIVERSITY REWARD

In this appendix we consider the general case of optimising diversity across N generations, where:

r D(y1, . . . , y N) =

i,j {1,...,N},i =j r D(yi, yj)

i,j {1,...,N},i =j 1 . (7)

The key originality is that this reward considers multiple trajectories simultaneously, while the traditional policy gradient from Sutton (2018) considers a single trajectory. We make the theoretical connection below.

Let s consider a symmetric multi-trajectories reward, where the expected reward is deﬁned as follows:

N Ex X Ey1,...,y N pθ( |x) [r D(y1, ..., y N)] . (8)

If we expand the inner expectation:

Ey1,...,y N pθ( |x) [r D(y1, ..., y N)] = X

y1,...,y N r D(y1, ..., y N)

i=1 pθ(yi|x). (9)

Then we take the gradient:

Ey1,...,y N pθ( |x) [r D(y1, ..., y N)] = θ

y1,...,y N r D(y1, ..., y N)

i=1 pθ(yi|x)

j=1 pθ(yj|x)r D(y1, ..., y N)

i=1,i =j pθ(yi|x) (11)

i=1 pθ(yi|x)

j=1 log pθ(yj|x)

r D(y1, ..., y N) (12)

y1,...,y N log pθ(yj|x)r D(y1, ..., y N)

i=1 pθ(yi|x) (13)

y1,...,y N log pθ(y1|x)r D(y1, ..., y N)

i=1 pθ(yi|x). (14)

For Equation (12), we used logx = x/x. For Equation (14), we used the symmetry of r D and change of variable. Hence, a non biased estimator of the policy gradient for J(θ) is:

r D(y1, ..., y N) log pθ(y1|x) where x X, y1, ..., y N pθ( |x). (15)

Due to the similarity of Equation (15) with the estimator of policy gradient for single-trajectory rewards, the diversity reward can be optimised with the usual policy gradient algorithms. As an example, the sampled gradient with a batch size B of the vanilla REINFORCE algorithm (Williams, 1992) would become 1 BN PB i=1 PN j=1 r D (yi,1, ..., yi,N) log pθ (yi,j|xi) with xi X, yi,j pθ( |xi).

Published as a conference paper at ICLR 2025

B CFG FOR TEXT-TO-MUSIC GENERATION

Figure 6 shows the impact of CFG when applied at inference time on the Music RL-R base model. Speciﬁcally, we plot the quality-diversity front of solutions when sliding the adherence hyperparameter γ for unconditional and negative prompting. We observe that CFG signiﬁcantly decreases diversity going from 0.5 (for γ = 1) to 0.3 (for γ = 7). Critically, we also see that the front of solutions for negative prompting with Bad audio quality. is above the one revealed by unconditional prompting. This speciﬁc negative prompt, inspired by those used in image generation (Mostaque, 2023), was selected based on preliminary experiments where it outperformed other negative prompt candidates. Finally, for quality the best γ is 3, which we use throughout this work.

3.8 4.0 4.2 4.4 4.6 Quality

negative prompting CFG unconditional CFG

Figure 6: Effect of CFG on quality and diversity as a function of γ. The quality is greatly improved at the expense of diversity. Negative prompting outperforms unconditional prompting.

Figure 7 shows the effect of CFG on the User Preference score and the Mu Lan score. Unconditional CFG improves the User Preference score and the Mu Lan score which indicates that the generations are both of better quality and of better adherence to the text. For negative prompting CFG, the User Preference score gain is doubled compared to unconditional. However, the Mu Lan score decreases which can be explained as the generations are biased towards good acoustic quality. With negative prompting CFG, the model tends to always produce high quality audio even when the prompt does not explicitly mention it.

Published as a conference paper at ICLR 2025

1 2 3 4 5 6 7

User Preference

negative prompting CFG unconditional CFG

1 2 3 4 5 6 7

negative prompting CFG unconditional CFG

Figure 7: Effect of CFG on the User Preference score and the Mu Lan score. Negative prompting CFG improves twice as much the User Preference score. For unconditional CFG, the Mu Lan score which reﬂects the text adherence increases which is an expect effect of CFG. For negative prompting CFG, there is a drop of Mu Lan which can be explained as we biased the generations towards good audio quality which is something that might not be present in the text prompt.

C DETAILS ON THE EVALUATION METRICS

Figure 8 shows the User Preference and the Mu Lan score for the different models. Figure 8 shows that (1) β = 0 matches the performances of CFG on both metrics and (2) increasing the β coefﬁcient primarily impacts the text adherence score while keeping the User Preference score nearly constant. It shows that that our approach improves the diversity by reducing the text adherence without impacting the audio quality.

User Preference

Figure 8: Comparison of the performances of the models on the User Preference score and the Mu Lan score. β = 0 matches the performances of CFG. Increasing the β coefﬁcient primarily impacts the text adherence score while keeping the User Preference score nearly constant. It shows that our approach improves the diversity by reducing the text adherence without impacting the audio quality.

Published as a conference paper at ICLR 2025

LERP(0, 15)

LERP(0, 15)

Human Diversity Elo

LERP(0, 15)

Entropy per token

Figure 9: Comparison between different measures of diversity; the one provided by our embedding model (left), the Elo rankings provided by human evaluation of diversity (middle) and the entropy per token (right). CFG at inference time (yellow) reduces all those measures of diversity. Models trained with our RL diversity reward (β = 15 in green) have increased diversity according to our embedding model and human evaluation, but not in terms of entropy.

D DIVERSITY IN MUSIC

D.1 EMBEDDING MODEL FOR DIVERSITY

Diversity in music. The objective of our diversity reward is to estimate the dissimilarity between two music segments. Diversity, especially in music, is a multifaceted concept encompassing various aspects such as instrumentals, melodies, tempo. In this work we conceptualise diversity through the following question: Are these two audio excerpts derived from the same audio recording or not?

Embedding. Drawing inspiration from Futeral et al. (2024), we train a self supervised contrastive model with positive pairs coming from non-overlapping ﬁxed-length chunks of the same audio segment. The model induces an embedding space where segments originating from the same longer audio are mapped to nearby points, while segments from different recordings are mapped further apart. This is useful because cosine similarity in the embedding space can quantify the similarity between two audio segments.

Details. Following Futeral et al. (2024), our embedding model takes as input a 4-seconds audio clip sampled at 16k Hz. We generate the mel-spectrogram using a window size of 512 samples, a hop length of 256 samples, and 156 frequency bins. This results in 376 time frames, each with 156 dimensions. The model architecture is founded on a compact Vision Transformer (Vi T) (Dosovitskiy et al., 2021), comprising 12 layers. Each layer incorporates 6 attention heads, an embedding dimension of 512, and a feed-forward layer with a dimension of 1024 (totaling 25 million parameters). The ﬁnal embeddings are averaged over time and then projected into a 192-dimensional space. For training, we employ the semi-hard triplet loss (Schroff et al., 2015) with a total batch size of 3840, for 200,000 steps, using 16 k Hz audio excerpts sourced from the same training dataset as Agostinelli et al. (2023). Additionally, we augment the training data with noise, tempo and pitch shifts.

D.2 COMPARISON OF OUR DIVERSITY METRIC WITH HUMAN EVALUATION AND ENTROPY

We compare the correlation between our diversity metric, the Elo score computed from human diversity evaluation in Section 3.5, and the entropy per generated token (which is a commonly used proxy for diversity (Zhang et al., 2021)). Figure 9 highlights that our diversity measure has higher correlation with the human raters notion of diversity than entropy. As an example, β = 15 scores high both in terms of Elo score and our diversity metric, while the entropy per token is as low as CFG (which is a low diversity approach). Yet, our measure of diversity remains imperfect and thus can be hacked, as further discussed in Section 5.

Published as a conference paper at ICLR 2025

D.3 OPTIMISATION OF THE DIVERSITY REWARD ACCORDING TO THE NUMBER OF GENERATIONS.

To compute the diversity reward r D(y1, . . . , y N), one important parameter is N the number of generations. Figure 10 shows the quality (left) and diversity (right) obtained when varying the number of generations N {2, 4, 8} while keeping constant the batch size at 128. Figure 10 shows that N = 2 is the best hyperparameter to optimise the diversity. Hence, we used N = 2 throughout our experiments.

0 1000 2000 3000 4000 5000 # steps

num gen=2 num gen=4 num gen=8

0 1000 2000 3000 4000 5000 # steps

num gen=2 num gen=4 num gen=8

Figure 10: Comparison between different values of N the number of generations per prompt. In this experiment we ran the same experimental setup (β = 15, batch size= 128) with N 2, 4, 8. The results show that while the quality score is roughly the same for every values of N (left), the diversity score is best optimised with N = 2 (right).

D.4 VENDI SCORE VS MEAN SIMILARITY

For the 101 prompts used for the quality human evaluation, we generated 8 music clips and we computed the diversity of these music clips with two measures (1) a simple 1-mean computed on the similarity matrix, and (2) the Vendi score (Friedman & Dieng, 2022) computed on the similarity matrix. Table 1 shows that both measures achieve the same ranking in term of diversity.

Table 1: Comparison of the diversity score computed on 8 generations generated with the prompts from the human evaluation. The diversity score is either computed with a simple 1-mean of the similarity matrix or with the Vendi Score (Friedman & Dieng, 2022) computed on the similarity matrix.

Model Vendi Score 1 - Mean

β = 15 2.91 0.60 base 2.40 0.46 LERP(0, 15) 2.35 0.44 β = 0 1.95 0.32 CFG 1.93 0.32

E DETAILS FOR HUMAN EVALUATION

Table 2 and Table 3 show the details results of the side by side human evaluation for respectively quality and diversity. The p-values for two statistical tests, a paired t-test and the Wilcoxon signed rank test (Rey & Neuh auser, 2011) are displayed in these tables. Our results are statistically signiﬁcant according to both statistical tests.

Published as a conference paper at ICLR 2025

Table 2: Details of the side-by-side quality human evaluation. Two statistical tests were performed on the results: the paired t-test and the Wilcoxon signed-rank test (Rey & Neuh auser, 2011). The p-values for these two tests are displayed in the last two columns.

Model A Model B #WINSA #WINSB #DRAWS paired t-test Wilcoxon test

base CFG 59 168 76 6.5e-14 1.0e-12 base β = 0 62 161 80 6.3e-13 5.0e-12 base β = 15 82 130 91 2.8e-04 3.0e-04 base LERP(0, 15) 49 184 70 1.7e-22 1.4e-19 CFG β = 0 107 104 92 7.4e-01 7.3e-01 CFG β = 15 133 78 92 2.7e-05 3.1e-05 CFG LERP(0, 15) 96 105 102 8.0e-01 8.2e-01 β = 0 β = 15 123 95 85 1.9e-02 1.8e-02 β = 0 LERP(0, 15) 99 109 95 5.1e-01 5.1e-01 β = 15 LERP(0, 15) 74 140 89 9.0e-06 1.4e-05

Table 3: Details of the side-by-side diversity human evaluation. Two statistical tests were performed on the results: a paired t-test and the Wilcoxon signed-rank test (Rey & Neuh auser, 2011). The p-values for these two tests are displayed in the last two columns.

Model A Model B #WINS0 #WINS1 #DRAWS paired t-test Wilcoxon test

base CFG 107 23 20 9.6e-14 1.8e-11 base β = 0 114 21 15 9.4e-20 1.0e-15 base β = 15 56 57 37 6.2e-01 6.4e-01 base LERP(0, 15) 95 28 27 1.5e-08 2.4e-07 CFG β = 0 62 56 32 5.8e-01 5.3e-01 CFG β = 15 30 100 20 2.1e-11 3.6e-10 CFG LERP(0, 15) 53 74 23 2.3e-02 1.6e-02 β = 0 β = 15 22 108 20 2.6e-16 1.0e-13 β = 0 LERP(0, 15) 51 75 24 3.9e-02 5.3e-02 β = 15 LERP(0, 15) 92 37 21 2.6e-07 4.9e-07

F CLAP SCORE METRICS

For the 101 prompts used for the quality human evaluation, we generated music clips and computed the CLAP score (Elizalde et al., 2023). As we generated 8 music clips per prompts, we computed a mean CLAP score and a standard deviation. Table 4 shows the CLAP score for the different models. All models perform equally.

Table 4: Clap scores computed on the 101 prompts used for the quality human evaluation. All models perform equally.

Model Clap Score Beta=0 0.52 0.11 Beta=15 0.52 0.11 LERP(0, 15) 0.52 0.11 Base 0.53 0.11 CFG 0.53 0.11

G FAD SCORE

Table 5 shows the FAD score computed on Music Caps (Agostinelli et al., 2023) with two audio embedding backbones namely CLAP (Elizalde et al., 2023) and VGGish (Hershey et al., 2017) and

Published as a conference paper at ICLR 2025

the FAD score computed on Song Describer (Manco et al., 2023). We see a similar pattern for all the variations. Without diversity (i.e. β = 0) CFG Distillation matches the performance of CFG. Then, as we increase diversity, FAD decreases consistently. However, there is an inconsistency as the base model has a lower FAD than CFG which is contradicted with the quality human evaluation that shows that CFG is better than base. The inconsistency on the FAD score were reported in previous work (Copet et al., 2023; Gui et al., 2024).

Table 5: FAD on Music Caps (Agostinelli et al., 2023) computed with either CLAP (Elizalde et al., 2023) or VGGish (Hershey et al., 2017) and FAD with CLAP on Song Describer (Manco et al., 2023). We see a similar pattern for all the variations. Without diversity (i.e. β = 0) CFG Distillation matches the performance of CFG. Then, as we increase diversity, FAD decreases consistently. However, there is an inconsistency as the base model has a lower FAD than CFG which is contradicted with the quality human evaluation that shows that CFG is better than base.

Dataset Music Caps Music Caps Song Describer

Audio Embedding FAD w/ VGGish FAD w/ CLAP FAD w/ CLAP

base 3.759 0.15 0.11 CFG 4.371 0.21 0.13 β = 0 4.357 0.21 0.13 LERP(0, 15) 4.986 0.23 0.13 β = 15 6.437 0.26 0.15