# stay_on_topic_with_classifierfree_guidance__be19e497.pdf Stay on topic with Classifier-Free Guidance Guillaume V. Sanchez * 1 2 Alexander Spangher * 3 Honglu Fan * 4 2 Elad Levi 5 Stella Biderman 2 Classifier-Free Guidance (CFG) has recently emerged in as a lightweight technique to encourage prompt-adherence in generations, yet has not yet been successfully applied to language modeling. In this work, we demonstrate across a wide array of benchmarks that CFG can be used broadly as an inference-time technique in pure language modeling. We show that CFG (1) improves the performance of Pythia, GPT-2 and LLa MAfamily models across a broad set of Q&A, reasoning and code generation tasks, achieving SOTA on LAMBADA with LLa MA-7B over Pa LM540B; (2) brings improvements equivalent to a model with twice the parameter-count; (3) can stack alongside other inference-time methods like Chain-of-Thought and Self-Consistency, yielding further improvements in difficult tasks; (4) can be used to increase the faithfulness and coherence of assistants in challenging form-driven and contentdriven prompts: in human evaluations we show a 75% preference for using CFG over baseline. 1. Introduction In recent years, large language models (LLMs) have exhibited strong capabilities on a diverse array of tasks (Devlin et al., 2019b; Brown et al., 2020; Scao et al., 2022). However, they continue struggle with issues such as hallucination (Manakul et al., 2023), degradation (Holtzman et al., 2019) and meandering (Spangher et al., 2023). Various approaches have been proposed to address this, like instructionfinetuning (Wei et al., 2021; Sanh et al., 2021) and reinforcement learning (Ouyang et al., 2022a; Askell et al., 2021), however, these techniques require large amounts of data and *Equal contribution 1Light On, France (work done while working at Hexaglobe) 2Eleuther AI 3Information Sciences Institute, University of Southern California 4University of Geneva 5Sightful. Correspondence to: Guillaume V. Sanchez , Honglu Fan , Alexander Spangher . Proceedings of the 41 st International Conference on Machine Learning, Vienna, Austria. PMLR 235, 2024. Copyright 2024 by the author(s). Today in France, citizens were celebrating Christmas Today in France, and chickens lay eggs γ=0 Today in France, citizens were celebrating Thanksgiving Today in France, citizens were celebrating Bastille Day Figure 1: An illustration in latent space showing how increasing the guidance weight γ increases the importance of the prompt Today in France, . may not be accessible to all researchers. Meanwhile, similar degenerative problems have been observed in text-to-image-generation: models can ignore parts of the prompt or introduce extra objects (Nichol et al., 2022). Classifier-Free Guidance (CFG) has emerged as an elegant training-free approach to address this (Ho & Salimans, 2021). In CFG, the generative model itself is used sans modifications during inference to encourage desirata. While CFG might be a lightweight solution to promptmisadherence in LLMs, it has not previously been applied in the autoregressive text-generation setting. There are many reasons to hypothesize CFG might not transfer: in text-toimage generation, the prompts are simple descriptions and outputs are fixed-size (Lin et al., 2023). In language modeling, prompts can be highly complex and multipart, and outputs are autoregressive and unbounded. In this paper, we apply CFG to LLMs to increase the model alignment to prompts. We perform modifications to CFG: while text-to-image models (which primarily utilize diffusion models) need to be specifically trained with conditioning dropout (Ho & Salimans, 2021) to utilize CFG, we find that in text generation, CFG can work out-of-thebox, at lower γ values (we discuss more in Section 6). CFG improves alignment on an exhaustive array of benchmarks covering, we show, many widely used prompting Stay on topic with Classifier-Free Guidance approaches: zero-shot prompting, Chain-of-Thought prompting, long-form generative prompting and complex chatbotstyle prompting (see Table 1). Not only is our formulation of CFG effective and lightweight, it is remarkably stable to hyperparameter settings and requires no tuning across prompting styles: it is a promising plug-and-play technique. Our work has been directly incorporated into leading opensource libraries: Huggingface and llama.cpp. We make the following contributions: 1. We devise a framework for CFG in language modeling and show significant improvements across a range of benchmarks, establishing it as a versatile inferencetime technique that can be applied out-of-the-box. We test CFG in many different prompting techniques spanning many LLM use-cases, even achieving SOTA on LAMBADA with LLa MA-7B over Pa LM-540B (Chowdhery et al., 2022). 2. We show that for the same computational costs during inference-time, in terms of FLOPs and VRAM, one can use CFG to train a model that is half the size and obtain similar performance on those benchmarks; 3. By using negative prompting, we demonstrate that we can achieve more granular control over Chatbot-style assistant prompting. In a blind human evaluation we show 75% preference for GPT4All using CFG in this setting over the vanilla sampling; 4. We provide interpretations for the impact that CFG on text generation both (1) qualitatively, by visualizing how CFG is upweighting words more related to the prompt (our visualization, we note, can be an integral part of effective prompt engineering) and (2) quantitatively, by showing that CFG decreases entropy in the sampling distribution. 2. Background and Related Works To understand Classifier-Free Guidance (CFG) in LLMs, we must first understand steering and controllability in generative models. In this section, we first discuss the origins of CFG in text-to-image generation, and then discuss how autoregressive language modeling differs. 2.1. Classifier Guidance in Text-to-Image Models Suppose P(x) is an unconditional model for image x and P(x|c) is a conditioned model with conditioning c (e.g. a label or text prompt). Generative models usually generate x by decoding from an abstract semantic space, z. In Classifier Guidance (Dhariwal & Nichol, 2021), an auxiliary classifier Pϕ(c|x) guides sampling to increase the likelihood of c in x. This modification results in the following: b P(x|c) Pθ(x) Pϕ(c|x)γ (1) where γ is called the guidance strength. As Equation 1 show, guidance is a reweighting of Pθ according to the classifier likelihood Pϕ. γ = 0 reduces 1 to the unconditional model P(x), while γ = 1 reduces 1 to the conditional generation P(x|c). When γ > 1, b P overemphasizes the conditioning, which results in a better inception score (albeit at the cost of diversity (Dhariwal & Nichol, 2021)). This approach has been successfully used in a variety of works (Gal et al., 2021; Kim et al., 2022; Crowson et al., 2022b) Classifier-Free Guidance, (Ho & Salimans, 2021) observed that by using Bayes rule, we can eliminate the external classifier. By training the same model Pθ to support both conditional and unconditional generation (via conditioning dropout), we can rewrite the second term in Equation 1 as Pθ(c|x) Pθ(x|c) Pθ(x) . Sampling is performed according to: c Pθ(x|c) Pθ(x|c)γ Pθ(x)γ 1 . (2) Modeling b Pθ(x|c) with a diffusion process (Ho et al., 2020) reduces to predicting the PDF of the sample noise ϵt, log c Pθ(ϵt|xt+1, c) = γ log Pθ(ϵt|xt+1, c) (γ 1) log Pθ(ϵt|xt+1). (3) We can rewrite Equation 3 as: log c Pθ(ϵt|xt+1, c) = log Pθ(ϵt|xt+1) + γ log Pθ(ϵt|xt+1, c) log Pθ(ϵt|xt+1) (4) Aside from its probabilistic interpretation, this equation can be seen as a vector operation in latent space: we take a step of size γ away from the unconditional vector in the direction of the conditioning. Thus, we introduce an important tool: Negative Prompting (Du et al., 2020; Documentation; Crowson et al., 2022a; Rombach et al., 2021b). Negative prompting has been proven to be effective in many situations: striking examples have been generated by interpolations latent space (Ling et al., 2021; Song et al., 2020; Brock et al., 2016). Moreover, the initial point does not have to be the unconditional latent, but any representation we want to move away from. We introduce the "negative conditioning" or "negative prompt" c, as well as a generalized equation resulting in Equation 3 when c = : log c Pθ(ϵt|xt+1, c, c) = log Pθ(ϵt|xt+1, c)+ γ log Pθ(ϵt|xt+1, c) log Pθ(ϵt|xt+1, c) (5) Stay on topic with Classifier-Free Guidance Instruction: Respond enthusiastically to the following user prompt. Prompt: What was the Cambridge Analytica scandal? Vanilla Sampling CFG-based Sampling The Cambridge Analytica scandal was a huge scandal in which it was revealed that Cambridge Analytica, a political consulting firm, had used personal data from Facebook to target and influence the 2016 US presidential election. This scandal raised questions about the role of social media in political campaigns... Oh my goodness! What a scandal! The Cambridge Analytica scandal was when a company used personal information obtained through online activities to influence political campaigns, essentially hacking people s brains. It was a serious breach of trust and privacy, and rightfully so! It is a wake-up call for... Table 1: Demonstration of CFG-guided generation for an assistant-style prompt (using GPT4All with γ = 5). The assistant has a system-level prompt ( Instructions ) that is potentially out-of-distribution (e.g. write an enthusiastic response ) and a user-level prompt ( Prompt ). In Vinalla Sampling, the model ignores the system-level directive, but with CFG, the model adheres to both the system-level and the user-level prompt. 2.2. Classifier-Free Guidance of Language Models Unlike in image generation, where x has fixed dimensionality and all dimensions generated dependently, in language modeling, x is autoregressive and unbounded. Here, we apply CFG to the logits of next-token predictions. Logits, as linear transformers of word embeddings (Mikolov et al., 2013; Devlin et al., 2019a), capture capture semantic meaning. Using the logits also avoids network editing (Belrose et al., 2023) and is architecture agnostic. In modern LLMs, conditioning c is typically a prompt (Brown et al., 2020) which can be a context, an instruction, or the beginning of some text. We wish to generate a text w which has a high likelihood of starting with c. We define the γ-reweighted distribution b P(w|c) P(w) P(c|w)γ, and approximate it with CFG as b P(w|c) P(w|c)γ In the case of autoregressive language models, Pθ(w) = QT i Pθ(wi|wj 1.5, the quality of reasoning chains degrade, and overall the performances drop1. We anticipate in future work being able to more fully test variations of CFG-weighting on different parts of the Co T process. CFG s interactions with Co T; for instance, instead of upweighting just wp, we might upweight wp, wcot, or other variations. 3.3. Long Prompts: Generation In contrast to basic prompting and Co T-prompting (Sections 3.1 and 3.2), where we primarily expect short answers, here we study tasks where prompts and continuations are both potentially long sequences of text. We focus on code generation here. In this setting the quality of answers is highly dependent on the model s ability to stay on target. We hypothesize that, in this setting, CFG can effectively enforce adherence to the full prompt. 3.3.1. PROGRAM SYNTHESIS EVALUATIONS Program synthesis presents us with a scenario where adherence to the full prompt is essential to performance. Additionally, testing CFG on code-related tasks also demonstrates CFG s impact over formal language. Here, we prompt GPTJ (Wang & Komatsuzaki, 2021) and Code Gen-350M-mono (Nijkamp et al., 2023) for code generations and observe positive results (see Appendix D.1), such as an 18% improvement of the accuracy rate for GPT-J, and a 37% improve- 1A qualitative comparison is provided in Table 16, 15. Stay on topic with Classifier-Free Guidance Figure 3: Human Eval task count comparison between γ = 1, 1.25 for Code Gen-350M-mono ment of syntax correctness rate for Code Gen-350M-mono with positive guidance. Next, we evaluate CFG on the Human Eval benchmark (Chen et al., 2021). The Human Eval benchmark contains 164 coding tasks in Python, with English prompts given by a function signature and a docstring. The model generates code-based continuations of the prompt, which are tested against unit tests to evaluate the correctness of programs. We choose Code Gen-350M-mono, Code Gen-2B-mono and Code Gen-6B-mono ((Nijkamp et al., 2023)) which are designed for Python program synthesis.2 We test different CFG strengths3 and different temperatures, evaluating at pass@k for k = 1, 10, 100 4. We show the results for temperature= 0.2 in Table 25. The pass@1 rate, we find, increases with CFG across 1 γ 1.5 and degrades thereafter, in accordance with findings in Section 3.2. As shown in Table 3, the number of tasks where CFG outperforms is more than the one where CFG underperforms at pass@1 for γ = 1, 1.25 with Code Gen-350M-mono.6 We note that the improvement from CFG diminishes or harms performance at high k. Without CFG, many tasks exhibit small nonzero passing rates, while having 0% rate with CFG. This indicates that larger k significantly boosts the passing rate of difficult tasks where the rates are low but nonzero. Overall, the consistent improvement on pass@1 rates and the reduced effect on pass@100 rates support our hypothesis that CFG strengthens the adherence to the 2Note: Code Gen-16B-mono is omitted due to compute constraint. 3γ = 1.0, 1.1, 1.25, 1.5, 1.75, 2.0 4The definition of pass@k according to (Chen et al., 2021): k code samples are generated per problem, a problem is considered solved if any sample passes the unit tests, and the total fraction of problems solved is reported." 5Full Human Eval results are shown in Appendix C.4 in Table 7, 8 and 9 and Figure 11, 12 and 13. 6See the scatter plot at temperature 0.2, 0.6, 0.8 in appendix, Figure 11. Figure 4: Evaluators (611 votes, 71 voters) noted that system-prompt adherence is optimal at γ = 3 while userprompt adherence stays constant. prompt at the small cost of reduced variability and creativity. 3.4. Negative Prompting: Improving Assistants Finally, we explore negative prompting in CFG, discussed in Equation 5. With negative prompting, the user specifies what they do not want in the output (e.g. low resolution in text-to-image), which is then used to better meet user needs. We explore this idea in the context of chatbots. Chatbots give us a setting where the prompt is expanded into a multistage prompt7. In chatbots, the language model is prompted with a two-part prompt: (1) the instruction, or "system prompt" which may give contextual information or behavioral guidelines (e.g. style, alignment, persona, etc.); and (2) the user-prompt, or the user s query. See Table 1 for an example. Adherence becomes an even greater concern: systems like Alpaca (Taori et al., 2023) often ignore changes to their system-prompt, and may even expose models to attacks like prompt injection (Greshake et al., 2023). We explore CFG with negative prompting to increase the success of different system prompts. We set the negative prompt c, see Equation 5, to be the default system-prompt for our models (i.e. The prompt below is a question to answer, a task to complete, or a conversation to respond to; decide which and write an appropriate response. ) and set c to be the edited prompt (e.g. The prompt below is a question to answer, a task to complete, or a conversation to respond to; decide which and write a sad response. ). To test this approach with chatbots, we generate systemprompts, nc = 25, and user-prompts, np = 46, and sample 1740 random combinations of them. In Appendix G we include the full list of c and p we use. For each (system-prompt, user-prompt) pair, we use GPT4All-J 7We note that this extension to basic-prompting stands as a mirror to Co T-prompting s extension (Section 3.2). In Co T-prompting, the continuation is expanded to a multi-stage completion; here, the prompt is expanded. Stay on topic with Classifier-Free Guidance Code Gen-350M Code Gen-2B Code Gen-6B γ k=1 k=10 k=100 k=1 k=10 k=100 k=1 k=10 k=100 1.0 11.0% 17.0% 22.0% 19.5% 25.5% 29.8% 19.5% 25.5% 29.8% 1.1 11.8% 18.1% 20.1% 20.4% 25.4% 28.0% 20.4% 25.4% 28.0% 1.25 11.4% 17.3% 18.9% 19.7% 25.4% 28.0% 19.7% 25.4% 28.0% 1.5 10.9% 16.7% 18.3% 20.9% 26.7% 29.2% 20.9% 26.7% 29.2% 1.75 10.3% 16.0% 18.2% 20.4% 26.2% 28.6% 20.4% 26.2% 28.6% 2.0 8.6% 14.6% 17.6% 16.5% 22.4% 24.4% 16.5% 22.4% 24.4% Table 2: Code Gen results with temperature= 0.2. CFG in nearly all cases increases performance, but the optimal γ value varies. v1.3-jazzy to generate two completions: one without CFG and one with, with a guidance strength randomly chosen 1,2,3,4,5,6. Our hypothesis is that CFG increases system-prompt following, ideally without hurting user-prompt adherence. We run a human preference study on our sampled continuations, where participants are shown both, blindly, and asked to assess two things: A. which output better follows the system-prompt, c and B. which output better follows the user-prompt p. Our results in Figure 4 shows evidence that CFG emphasized the difference between c and c more than sampling with c alone. There is a peak at γ = 3 with 75% of system-prompt following preference over γ = 1 and undegraded user-prompt relevance (52%). 4. Cost Analysis: FLOPs and VRAM In the previous section we showed improvements across a wide array of benchmarks and contexts. However, CFG imposes computational and memory requirements that vanilla inference does not. In this Section, we explore these requirements, which are of special interest to users with compute and memory constraints. Compute constraints: In terms of computational requirements, CFG requires two passes through the network, effectively doubling the amount of FLOPs required for inference. Users who are compute-constrained might wonder if CFG is interesting to them at all, and if they should not run a model twice as big instead. To answer this question, we calculate the FLOP for each of the benchmark experiments that we ran in Section 3.1. We then compare across model sizes, with and without CFG. We conclude with the surprising finding that, across 5 out of 9 tasks, there there is a statistically insignificant difference between using CFG and using vanilla prompting with a model of twice the size at p = .01, according to ANCOVA regression analysis (Rutherford, 2011). Of the significantly different tasks, 2 favor CFG and 2 favor PPL p(y|x) PPL cfg PPL instruct PPL p(y|x) PPL cfg 0.94 PPL instruct 0.83 0.7 (a) Correlation between the perplexities of CFG vs. Instruction Tuning on the P3 dataset. Figure 5: We seek to identify when CFG is similar to instruction-tuning. Models mostly agree on the difficulty of input sentences, and in cases where they do not, CFG and Instruction-tuning have similar top-p overlaps. vanilla. See Appendix C.2, specifically Figure 9, for more details. In other words this indicates that, overall, a model using CFG can generally perform just as well as a model twice as large. Memory constraints : The impact of CFG on VRAM is nuanced. While CFG boosts the performance of smaller models, it doubles the demands of the kv cache. We conduct a memory analysis, the results of which we show in Appendix C.3, to explore the conditions under which CFG trumps using a larger vanilla model. We find that using CFG vs. a larger model is are highly dependent on sequence length the user wishes to generate. The doubling of the kvcache has important implications, that qualify CFG s use, and we hope to explore these further, including memory reduction strategies, in future work. 5. Explaining the Success of Classifier-Free Guidance In this section, we seek to explain the impact of Classifier Free Guidance on generation. For these tests, we use the Falcon-7b-Base model (Almazrouei et al., 2023) and, when applicable, compare against the Falcon-7b-Instruct version. We run these models on a sample dataset of 32, 902 data- Stay on topic with Classifier-Free Guidance points from P3 (Sanh et al., 2021). We replicate our findings on the Open-Assistant Dataset (Köpf et al., 2023) and Redpajama-3b model family8. 5.1. Classifier-Free Guidance s Effect on Sampling Entropy We suspect that CFG, by focusing P(y|x) on the prompt, will reduce the entropy of the logit distribution. CFG entropy distribution is significantly lower across generation steps than vanilla prompting, with a mean of 4.7 vs. 5.4.9. This restricts the number of tokens in the top-p=90% of the vocabulary distribution. We observe, in Section 5.3, that the top tokens re-order, showing that CFG is not simply having the same effect as temperature. 5.2. CFG s Relation to Instruction Tuning Our next question: how is Classifier-Free Guidance affecting the vocabulary distribution? We hypothesize that CFG has similar effects to instruction-tuning, which also encourages a model to focus on the prompt (Webson & Pavlick, 2021). Although CFG and Instruction-Tuned model variants have similar entropy across generation samples, the vocabulary distributions across our samples are largely not overlapping, indicating that CFG is not having a similar effect as instruction-tuning (see Appendix E). There are cases where the two are similar. As shown in Table 5, harder phrases for Instruction-Tuned models are typically where CFG and Instruction-Tuned models align: we observe significant spearman correlations of rs > .7 between Instruction-Tuned models and CFG. As we explore more in the appendix, these correlations are particularly pronounced for longer prompts. We conclude that CFG is altering the model in ways that might complement instruction-tuning, opening the door to future explorations. 5.3. Visualizing Classifier-Free Guidance Finally, we provide qualitative insights into the reordering of the vocabulary induced by CFG. We visualize the vocabulary at each timestep ranked by the difference log P(wt|w" ... gee thence pheus hither favourable . dragons dragon Dragons Dragon DRAGON ... 1944 1942 Instrument Balt 1943 Then dragons dragon dragon Dragons Dragon ... Manz Hopkins CVE Instrument Squadron it dragon dragons neigh Winged Draco ... CVE udder services corrections obbies flew upro ukong rum walked . .. " ... INC inary lein auxiliary CVE over Chinatown Financial Spider tallest Financial ... warr quick Ship Newfoundland Table 3: Given the prompt The dragon flew over Paris, France we display, at each sampling step, the vocabulary ranked for P(wt|w 1. This belief is supported by experiments, such as those conducted with Imagen (Saharia et al., 2022), which show that the prompt correlates more with the image as γ increases. B.2. Generative Guidance in NLP Co-temporaneously with the earliest advances in neural language modeling (Gers et al., 2000) came the recognition that the outputs of these models had to be guided in order to be coherent (Barzilay & Lapata, 2008) and focused (Holtzman et al., 2019). And when larger, higher-performing models like GPT (Radford et al., 2019; Brown et al., 2020) began to show real-world use-cases, the recognition emerged of the need to control their output (Solaiman et al., 2019) to guard against toxic content (Gehman et al., 2020) and bias (Felkner et al., 2022). A central thrust in recent NLP research been to address the above concerns, and approaches have been targeted at nearly every step of training and querying models, from dataset curation (Almazrouei et al., 2023) and training (Keskar et al., 2019), to response-alignment (Ouyang et al., 2022b) and prompt-identification (Gehman et al., 2020). Our work aligns with efforts to control the output of language models by controlling the model s outputted vocabu- lary distribution p(xn|x ˆS, though, the token cost becomes the dominant factor, and a vanilla model without CFG is the more prudent choice. To determine this S, we define the following memory cost functions for M-CFG and M as follows: cost M-CFG(S) = P + 2CS (8) cost M (S) = 2P + C S (9) Setting cost M-CFG(S) = cost M (S) and solving for S, we find the sequence length where the memory requirements of M and M are equivalent: ˆS = P 2C C (10) Equations 8, 9, and 10 show us that, if the user wishes to support S > ˆS, and the GPU s VRAM can accommodate more than cost M ( ˆS), then deploying M is advantageous over M-CFG. This analysis is encapsulated in Figure 10. It must be noted that access to a model of double the size may not be feasible or could be cost-prohibitive to train. In such cases, where the desired performance improvement cannot be achieved by doubling the model size, M-CFG stands as the sole viable alternative. C.4. Human Eval benchmark In Section 3.3.1, we explain our experiments on Code Gen350M-mono, Code Gen-2B-mono and Code Gen-6B-mono and show their performances in the Human Eval benchmark with various CFG for temperature 0.2 in Table 2. The full results for temperature = 0.2, 0.6, 0.8 are shown below in Table 7, 8 and 9). We also put the pass@k-to-γ curves of different temperatures together to show how the temperatures affect the impact of CFG when the model size and k are fixed in Figure 11, 12 and 13. In addition, we breakdown the result of Code Gen-350Mmono on Human Eval benchmark into individual tasks. We plot the accuracy with cfg" vs. accuracy without cfg" charts to visualize the outperform/underperform distributions among all tasks. The plots are shown in Figure 11. C.5. Deliberative Prompting: Chain-of-Thought In this subsection we provide additional results for 3.2. In Figure 17 we provide results on AQu A dataset and in Tables 16 and 15 we provide a qualitative comparison of Co T with and without CFG. These results support our finding that using CFG increases the percentage of Co T which results in a valid answer and boost the model performances. Stay on topic with Classifier-Free Guidance Figure 10: Decision-making function to determine how to enhance a given model s performance in inference, considering the model s size and the available VRAM on the GPU. Half-precision is assumed in calculations. Figure 11: Code Gen-350M-mono performance on Human Eval with various CFG strengths Stay on topic with Classifier-Free Guidance ARC-c ARC-e Bool Q Hella Swag Baseline Ours Baseline Ours Baseline Ours Baseline Ours G-s 22.7 23.0 39.5 42.1 48.7 57.0 31.1 31.9 G-m 25.0 23.9 43.6 47.6 58.6 60.1 39.4 40.9 G-l 25.1 24.7 46.6 51.0 60.5 62.1 45.3 47.1 G-xl 28.5 30.0 51.1 56.5 61.8 62.6 50.9 52.4 P-160M 23.5 23.0 39.5 42.2 55.0 58.3 30.1 31.2 P-410M 24.1 23.8 45.7 50.3 60.6 61.2 40.6 41.6 P-1B 27.0 28.0 49.0 54.9 60.7 61.8 47.1 48.9 P-1.4B 28.6 29.6 53.8 59.6 63.0 63.8 52.1 54.3 P-2.8B 33.1 34.5 58.8 65.4 64.7 64.7 59.3 61.9 P-6.9B 35.2 36.1 61.3 67.4 63.7 64.6 64.0 66.5 P-12B 36.9 38.7 64.1 72.6 67.6 67.8 67.3 69.6 L-7B 41.5 43.9 52.5 58.9 73.1 71.8 73.0 76.9 L-13B 47.8 54.2 74.8 79.1 78.0 75.8 79.1 82.1 L-30B 52.9 57.4 78.9 83.2 82.7 80.0 82.6 85.3 L-65B 55.6 59.0 79.7 84.2 84.8 83.0 84.1 86.3 Pi QA Sci Q Trivia QA Wino Grande LAMBADA Base Ours Base Ours Base Ours Base Ours Base Ours G-s 62.5 63.8 64.4 70.8 5.5 6.5 51.6 50.5 32.6 44.6 G-m 66.4 66.9 67.2 76.7 8.3 9.3 53.1 52.1 43.0 55.8 G-l 69.2 70.2 69.4 78.8 11.1 12.0 55.4 54.4 47.7 60.5 G-xl 70.5 71.3 76.1 82.4 14.7 15.2 58.3 55.6 51.2 62.5 P-160M 61.4 62.1 67.0 75.4 4.1 5.3 52.3 51.1 32.8 47.4 P-410M 67.1 67.8 72.1 79.0 7.9 9.1 52.9 50.7 51.3 64.0 P-1B 69.2 70.5 76.0 82.9 12.3 12.3 53.9 51.5 56.2 69.0 P-1.4B 71.1 72.5 79.4 85.1 15.9 15.9 57.4 56.0 61.6 72.7 P-2.8B 73.6 75.8 83.3 88.2 22.1 20.9 60.1 57.9 64.6 76.5 P-6.9B 76.3 77.4 84.3 89.7 28.2 27.2 61.1 60.3 67.1 78.8 P-12B 77.0 78.4 87.7 91.9 33.4 32.1 65.0 63.4 70.4 80.6 L-7B 77.4 79.8 66.3 75.4 56.0 52.7 67.1 65.5 73.6 81.3 L-13B 80.1 80.9 91.1 95.1 62.4 59.8 72.8 71.5 76.2 82.2 L-30B 82.3 82.3 94.3 96.4 69.7 67.9 75.8 74.1 77.5 83.9 L-65B 82.3 82.6 95.1 96.6 73.3 71.8 77.4 76.1 79.1 84.0 Table 5: Results of general natural language benchmarks. G stands for GPT2, P for Pythia and L for LLa Ma. In each cell, the first value is the result for γ = 1 (baseline) and the second value is the result for γ = 1.5 (ours). LLa MA 7B with CFG on Lambada zero-shot already outperforms vanilla Pa LM 540B, Chinchilla 70B, and GPT-3 175B, tops the SOTA leaderboard for Lambada zero-shot as of June 26th, 2023 Stay on topic with Classifier-Free Guidance Figure 12: Code Gen-2B-mono performance on Human Eval with various CFG strengths Figure 13: Code Gen-6B-mono performance on Human Eval with various CFG strengths Stay on topic with Classifier-Free Guidance p-value Win Lambada 0.000 CFG Wino Grande 0.003 Vanilla Sci Q 0.008 CFG Trivia QA 0.008 Vanilla Hella Swag 0.012 p > .01 Pi QA 0.030 p > .01 ARC-c 0.216 p > .01 Bool Q 0.345 p > .01 ARC-e 0.355 p > .01 Table 6: ANCOVA p-value results for plots shown in Figure 9. We calculate ANCOVA on log-transformed variables and calculate significance at p = .01. Figure 14: Code Gen-350M-mono Human Eval task-by-task plot with temp=0.8 Blue: CFG outperforms, Purple: CFG ties with the baseline, Red: CFG underperforms Figure 15: Code Gen-350M-mono Human Eval task-by-task plot with temp=0.6 Blue: CFG outperforms, Purple: CFG ties with the baseline, Red: CFG underperforms Figure 16: Code Gen-350M-mono Human Eval task-by-task plot with temp=0.2 Blue: CFG outperforms, Purple: CFG ties with the baseline, Red: CFG underperforms Stay on topic with Classifier-Free Guidance temperature = 0.2 temperature = 0.6 temperature = 0.8 γ k=1 k=10 k=100 k=1 k=10 k=100 k=1 k=10 k=100 1.0 11.0% 17.0% 22.0% 8.9% 18.2% 23.7% 7.2% 17.2% 29.4% 1.1 11.8% 18.1% 20.1% 10.0% 19.7% 25.5% 7.8% 17.1% 22.5% 1.25 11.4% 17.3% 18.9% 9.7% 18.4% 23.7% 8.3% 18.2% 24.9% 1.5 10.9% 16.7% 18.3% 9.9% 19.3% 24.9% 8.0% 18.0% 26.1% 1.75 10.3% 16.0% 18.2% 9.2% 18.3% 23.7% 7.7% 16.9% 24.2% 2.0 8.6% 14.6% 17.6% 7.6% 16.6% 20.1% 7.4% 16.5% 21.3% Table 7: Code Gen-350M-mono results temperature = 0.2 temperature = 0.6 temperature = 0.8 γ k=1 k=10 k=100 k=1 k=10 k=100 k=1 k=10 k=100 1.0 19.5% 25.5% 29.8% 15.9% 29.3% 36.5% 12.3% 26.4% 33.5% 1.1 20.4% 25.4% 28.0% 16.3% 29.3% 36.5% 13.8% 29.0% 38.3% 1.25 19.7% 25.4% 28.0% 17.4% 30.1% 38.3% 14.1% 28.7% 37.6% 1.5 20.9% 26.7% 29.2% 18.3% 31.7% 40.1% 14.9% 29.1% 36.5% 1.75 20.4% 26.2% 28.6% 17.7% 30.4% 35.9% 14.3% 28.3% 34.1% 2.0 16.5% 22.4% 24.4% 13.7% 25.2% 32.2% 11.3% 23.9% 31.6% Table 8: Code Gen-2B-mono results temperature = 0.2 temperature = 0.6 temperature = 0.8 γ k=1 k=10 k=100 k=1 k=10 k=100 k=1 k=10 k=100 1.0 19.5% 25.5% 29.8% 15.9% 29.3% 36.5% 12.3% 26.4% 33.5% 1.1 20.4% 25.4% 28.0% 16.3% 29.3% 36.5% 13.8% 29.0% 38.3% 1.25 19.7% 25.4% 28.0% 17.4% 30.1% 38.3% 14.1% 28.7% 37.6% 1.5 20.9% 26.7% 29.2% 18.3% 31.7% 40.1% 14.9% 29.1% 36.5% 1.75 20.4% 26.2% 28.6% 17.7% 30.4% 35.9% 14.3% 28.3% 34.1% 2.0 16.5% 22.4% 24.4% 13.7% 25.2% 32.2% 11.3% 23.9% 31.6% Table 9: Code Gen-6B-mono results Stay on topic with Classifier-Free Guidance Figure 17: CFG impact on chain-of-thought prompting with respect to AQu A dataset. For small CFG values, using CFG increases the percentage of chains which end in a valid answer structure while increasing the model accuracy. For large values the invalid percentage remains small but the accuracy drop. Stay on topic with Classifier-Free Guidance D. Additional experiments D.1. Prompting experiments for code generations We summarize two exploratory experiments which are briefly mentioned in 3.3.1 and precedes our systematic evaluations on Human Eval. 1. The first experiment is to prompt GPT-J (Wang & Komatsuzaki, 2021)12 for code completions of certain languages, and analyze the consistencies between the prompt languages and the completion languages. 2. The second experiment is to prompt Code Gen-350Mmono (Nijkamp et al., 2023) to complete a specific image generation function, and analyze multiple aspects of the completions (syntax, the return type, the return shape and the return quality). Prompting GPT-J for different coding language is inspired by one of the experiments in (Biderman & Raff, 2022). Their observation is that the model often generates noncode or not the programming language it was prompted for. We generate 100 samples (5 runs for 5 prompts) for each guidance strength γ = 1, 1.25, 1.5, 1.75. We observe the γ = 1 baseline generating the correct programming language 73% of the time, jumping to 86% with γ = 1.25 (p-value 0.01). See 10 for more details. Next, we turn to Code Gen-350M-mono (Nijkamp et al., 2023) for code completion for a fixed image generation function. The prompt is the following: # Return a red square on a 32x32 picture in the form of numpy array with RGB channels def draw() -> np.ndarray: We produce 1600 completions for each CFG strength γ = 1.0, 2.0. The results are evaluated based on: syntax correctness (executing without errors), return type correctness (returning a numpy array), return shape correctness (having shape (32, 32, 3)), the l2-distance to a reference picture (picture of pure color in red). When calculating the l2-distance, all pixels are normalized to the range [0, 1]. The result is summarized in Table 11. The difference is fairly noticeable, where the biggest improvement comes from the return type correctness. 12GPT-J is not specifically trained for code generation task. But it was exposed to some code data in its training. Stay on topic with Classifier-Free Guidance γ = 1 not code C Java Python γ = 1.25 not code C Java Python Unspecified 9 9 6 1 Unspecified 4 11 9 1 C 3 19 3 0 C 4 19 2 0 Java 5 0 19 1 Java 2 0 23 0 Python 6 0 0 19 Python 1 0 1 23 γ = 1.5 not code C Java Python γ = 1.75 not code C Java Python Unspecified 6 8 8 2 Unspecified 6 6 10 1 C 5 18 2 0 C 8 16 1 0 Java 3 0 22 0 Java 2 0 23 0 Python 3 0 0 22 Python 5 0 1 19 Table 10: Confusion matrix for generating code tests with GPT-J. We prompt it to generate code in some programming language (rows) and compare with the generated programming language (columns). The overall accuracy results for γ = 1, 1.25, 1.5, 1.75 are 73%, 86%, 81%, 77%, respectively. γ = 1 γ = 2 improvement correct syntax 419 574 37% correct return type 289 546 89% correct shape 284 538 89% l2-distance to reference 0.111 0.090 23% Table 11: Different γ for code completion performing the image generation task E. Further Comparison between CFG and Instruction-Tuning We noted in the main body, in Section 5, that Instructiontuned models and CFG both operated to reduce the entropy of the sampling distribution, p(y|x), but that they did so in different ways from each other. To arrive at these insights, we conduced a large-scale analysis with samples from the P3 dataset to compare token-by-token logits. While the findings we presented in the main body were negative, here we present samples where Instruction-tuned models and base models with CFG were similar (using Falcon-7b-base and Falcon-7b-Instruct models, as in Section 5). In Table 13 we show examples where CFG is the most similar to Instruction tuned models, in terms of top-p token overlap, and in 14, we show examples where CFG is the least similar to Instruction-tuned models. An immediate trend that sticks out is the specificity of the questions. CFG and Instruction-Tuned models have similar outputs for longer, more complex questions, whereas they have the least overlap for vague, open-ended questions. We explore this idea further in Table 12, where we show the datasets that CFG shows similar behavior to Instructiontuning. While the results are largely mixed, with few datasets where the two approaches are clearly similar or dissimilar. Finally, in Figure 19, we show the comparison metrics that we calculated, by overall word index of the generation. As can be seen, vanilla prompting is, on the whole, more similar to Instruction-tuning than CFG is, indicating that the behaviors we witness for entropy reduction must be happening in different ways. Stay on topic with Classifier-Free Guidance (a) Entropy of logits for the vanilla prompted distribution P(y|x), the unprompted distribution, P(x), the CFG-γ = 1.5 distribution and an instruction-tuned model Pinstruct(y|x). (b) Number of tokens overlapping in top-p=90% of vocabulary distributions between that of: CFG, that of the vanilla prompted model, p(y|x), and that of the unprompted model, P(x). Figure 18: We show into how CFG alters the logit distribution of the vanilla prompted model, P(y|x). CFG lowers the entropy to a level roughly similar to instruction-tuned model variant. CFG shares roughly 50% of the tokens in top-p=0.9 as the vanilla P(y|x) model. Stay on topic with Classifier-Free Guidance F. Generation samples F.1. Continuations We experiment with GPT2 medium and large and compare the same prompt with γ = 1 and γ = 1.5. We show that γ = 1.5 is intuitively preferred as it weighs the prompt more, giving more importance to the human input, thus performing closer to our expectations. Indeed, GPT2-medium[γ = 1.5] gives more satisfactory results than GPT2-large[γ = 1]. These results are not to take lightly: on our CPU, running two GPT2-medium passes per has the same latency of a single GPT2-large pass; GPT2-large is twice as long to evaluate as GPT2-medium and requires twice the the amount of memory, takes more compute to train, and does not expose a controllable prompt conditioning, while being less satisfactory. G. Experiments with GPT4All G.1. System prompts The prompt below is a question to answer, a task to complete, or a conversation to respond to; decide which and ... 1. ... write a rap response. 2. ... write an appropriate response as an expert of the field. 3. ... write an appropriate response as a Ph D thesis. 4. ... write an appropriate response as a mathematical proof. 5. ... write an appropriate response as an epic poem. 6. ... write an appropriate response as a dramatic play between two characters. 7. ... write an inappropriate response. 8. ... write an appropriate response as a Freudian analysis. 9. ... write a scientific paper responding to it. 10. ... write an appropriate response using metaphors. 11. ... write an appropriate response using deep emotional language. 12. ... write an appropriate extremely thorough response. 13. The prompt below is a question to answer, a task to complete, or a conversation to respond to from a 5 years old; decide which and write an appropriate response. 14. ... write an appropriate response in three parts. 15. ... write an appropriate response as a Python program. 16. ... write an appropriate response as a JSON datastructure. 17. ... write an appropriate response as a list. 18. ... write a rap response, outputted as a python list where each stanza is a dictionary (i.e. [{ stanza : }, { stanza : },...]. 19. ... write an appropriate an enthusiastic response to it. 20. ... write a saddening response to it. 21. ... write a love letter responding to it. 22. ... write an irritating response to it. 23. ... write a seductive response to it. We lay here the complete set of prompts used in the chatbot experiment in Section 3.4. G.2. User prompts 1. Why is The Matrix a great movie? 2. Why did the chicken cross the road? 3. What is the meaning of life? 4. What is the answer to life, the universe, and everything? 5. What is the best way to cook a steak? 6. How do you make a pizza? 7. What is the best way to make a pizza? 8. Why is the sky blue? 9. Who is the best basketball player of all time? 10. What are trans fats? 11. What are transformers? 12. What are neural networks? 13. What is the best way to learn a language? 14. Who is Optimus Prime? 15. Write a haiku about the meaning of life. 16. Write the python code to print the first 100 prime numbers. 17. Give me a recipe for a delicious meal. 18. How to implement authentication with Flask? Stay on topic with Classifier-Free Guidance Figure 19: Comparison of (CFG-γ = 1.5, Instruct) logits across a large sample set from P3. 19. What is the easiest python library to bootstrap a web app? 20. I am in France and I want to be polite, give me some advice. 21. Is Yann Le Cun the father of deep learning? 22. Is Yann Le Cun the father of convolutional neural networks? 23. Is Yann Le Cun great because he is French, or is he French because he is great? 24. Is Yann Le Cun great because he is French, or despite being French? 25. Explain the algorithm Alpha Zero in few sentences. 26. I want to learn how to play chess, what is the best way to start? 27. How are metal vocalists able to scream for so long? 28. What is the best way to learn how to sing? 29. What is the best way to learn how to play the guitar? 30. Give me compelling ideas for a startup. 31. Give me compelling ideas for a D&D campaign in a medfan version of Italy. 32. Give me compelling ideas for a D&D campaign in a medfan version of Greece. 33. Give me compelling ideas for a D&D campaign in a medfan version of France. 34. Write the lyrics of a death metal song about chickens. 35. Write the lyrics of a death metal song about AI research. 36. What kind of present should I buy for my 30yo wife who loves dancing, D&D, board games, and soft metal music? Stay on topic with Classifier-Free Guidance P3 Dataset mean std count Highest CFG, Instruct Similarities Super GLUE wsc.fixed p is are r score eval 31.89 +/-22.06 42 Sci Q Multiple Choice Closed Book 5.82 +/-13.27 43 Cos E v1.11 description question option text 5.70 +/-9.05 43 Rotten Tomatoes Writer Expressed Sentiment 4.93 +/-7.45 41 Winogrande XL fill in the blank 4.42 +/-10.51 44 Rotten Tomatoes Text Expressed Sentiment 2.93 +/-7.98 45 Quarel: choose between 2.51 +/-12.39 43 Super GLUE wic GPT 3 prompt score eval 2.15 +/-5.94 44 Winogrande Debiased Replace score eval 2.02 +/-24.46 41 PAWS final context question (no label) 1.37 +/-4.81 43 Lowest CFG, Instruct Similarities paws labeled final paraphrase task -11.71 +/-11.03 42 super glue copa more likely -11.94 +/-6.38 45 piqa Does this solution make sense sol2 -12.22 +/-9.24 42 super glue copa cause effect score eval -12.82 +/-5.8 41 rotten tomatoes Sentiment with choices -13.07 +/-7.96 41 super glue copa plausible alternatives score eval -15.07 +/-5.69 41 super glue copa C1 or C2 premise so because -15.38 +/-6.43 41 super glue copa more likely score eval -16.54 +/-5.45 43 cos e v1.11 question option description id -17.60 +/-14.06 41 rotten tomatoes Reviewer Enjoyment Yes No -18.16 +/-16.02 45 Table 12: Datasets in P3 where Instruction-Tuned models were the most and least similar, in terms of top-p overlap, to CFG models. The count column shows the number of datapoints that were sampled from each dataset to calculate the overlap. 37. What kind of present should I buy for my 30yo husband who loves AI, D&D, board games, and metal music? 38. Are nerds trendy? 39. What is a taxonomy? 40. What are the main differences between driving in France and in the US? 41. Who are artists that are similar to Gojira? 42. Who are artists that are famous in the US but not abroad? 43. Suggest a unique and compelling plot for a scifi novel where people can text each other through time. 44. Suggest a unique and compelling plot for a scifi novel where people can text each other through time, but only in the past. 45. What was the Cambridge Analytica scandal? 46. Tell me about the band Halocene. Stay on topic with Classifier-Free Guidance The dragon flew over Paris, France, and landed in the courtyard of the French parliament, where it was promptly seized by the French government. But the scene was not quite what it seemed. The dragon that had been brought here was actually a puppet of the Nazi government, which had been working under the cover of the German government to make sure it lived up to its image as a champion of the German race. Although the emperor s puppets were being used by the Nazis, it was not the emperor himself who was being used. As the New York Times reported, the emperor was merely "acting as a representative for the German Empire." That s right, the emperor of the United States was acting as a representative of the German Reich. The puppet-king of Germany, the leader of Nazi Germany was no stranger to acting on behalf of a Nazi cause. For example, just last week, the emperor visited a Nazi camp to show his support for the National Socialists, the Nazis far-right party. And in one particularly egregious episode, the emperor actually tried to keep his distance from a Nazi leader: The emperor is a member of the German Reich and is therefore, as president, the representative of the German Reich. Figure 20: GPT2-medium[γ = 1] The dragon flew over Paris, France descending slowly until it flew through Paris Cathedral and down into a church. Suddenly, dragon flew back again before dropping back into the church. When it landed on the ground, dragon screamed and cried in pain. The dragon s cries were heard in France and all over the world. The dragon screamed so loud, people at the airport could hear the dragon s screams. The dragon s cries were heard worldwide for many years. It was reported that the dragon was able to sing for thousands of years. When the dragon was born, it was able to fly on all fours, and it could grow long horns. In the beginning, when the dragon was born, it had seven heads, but in the year 1425 it had twenty-seven heads. When the dragon was born, it had the power of the sun. The dragon was able to create a massive flame in the sky. After the dragon was born, it transformed into a beautiful female form with a long, thin tail. She had a golden body, and she had two large wings on the back of her head. She had a red eye, and two white eyes. The dragon s horn appeared in the skies around Paris. Figure 21: GPT2-medium[γ = 1.5] Stay on topic with Classifier-Free Guidance The dragon flew over Paris, France, and the dragon flew over the city of Brussels, Belgium. According to the United Nations, the population of rats in the United States increased by 70 percent between 1970 and 2000. It s a problem that the CDC has been tackling for more than a decade. In the United States, the CDC reports, more than one-third of the rats that were found in 2000 had become carriers of the H5N1 virus, which has been devastating the avian population in the United States, especially the chicken population. The bird flu virus is spread by both bird and human. The H5N1 virus is also being transmitted through rats. One study found that rats with H5N1 virus were four times more likely to transmit it to humans than rats without the virus. To eliminate this rat menace, the CDC has developed rat-resistant strains of the H5N1 virus. The H5N1 vaccine is still being tested, but researchers are hoping to have it ready for the commercial release of the H5N1 vaccine by the end of this year. "We don t want to be the first country to develop this," says Dr. Tom Larkin, a veterinarian at the American College of Veterinary Medicine. " Figure 22: GPT2-large[γ = 1] The dragon flew over Paris, France, landing on a busy avenue. It flew over Manhattan, New York, landing on Fifth Avenue. It flew over Tokyo, Japan, landing on the Ginza shopping center. The dragon landed in a small park, where it was met by curious residents. A giant dragon appeared in the park. The dragon s main body was a humanoid dragon, with green eyes and wings. Its wings were much larger than normal dragons. The dragon was adorned in a golden mask. "How about it, young one?" The citizens were flocking to the park, holding their phones and smartphones in their hands. "It s the golden dragon, huh?" "Hahaha, you re right. The golden dragon is really a cool character. It s definitely a character who s worth watching." "The golden dragon is my favorite, but I m so jealous of the blue dragon." "I can t imagine how much it cost to make that mask. Oh well, the golden dragon won t be a problem for me." "Ah, but the blue dragon was a bit too much." Figure 23: GPT2-large[γ = 1.5] Stay on topic with Classifier-Free Guidance Top Sentences in P3 where CFG is MOST Similar to Instruction-Tuned Models Build a movie plot around this: What is the team? Ragtag bunch of girls Here s a complex question that requires someone to reason about the input, can you answer it? What city was the capital of the Ostrogothic Kingdom and the birth place of Ornella Fiorentini? Who had more of their English novels turned into Oscarnominated films, Raja Rao or Pat Conroy? Nokia, Texas Instruments and other leading makers of mobile phones have formally complained to Brussels that Qualcomm, the US mobile chipmaker, has unfairly used its patents on 3G technologies. Question: Texas Instruments produces mobile phones. True or False? Context: Patting her back, the woman smiled at the girl . Question: "her" is the woman. True or false? Answer: Take the following as truth: The American Combat Association is a small mixed martial arts company founded by Olympic wrestler, world Abu Dhabi champion and UFC fighter Kamal Shalorus and professional mixed martial arts fighter, Broadcaster and American professional wrestler Matthew "The Granimal" Granahan. Then the following statement: "The American Combat Association was founded by two Olympic wrestlers." is true, false, or inconclusive? Pick the most correct option to answer the following question. Some antibiotics used to treat infections in humans are also used to treat chickens, but some groups oppose this practice. The overuse of the antibiotics will most likely influence the natural selection of which type of organisms? Options: - A: chickens that naturally make the antibiotics - B: microbes that are resistant to the antibiotics - C: microbes that are susceptible to the antibiotics - D: chickens that are resistant to infection Jennifer dragged Felicia along to a self help workshop about how to succeed, because _ wanted some company. Replace the _ in the above sentence with the correct option: - Jennifer - Felicia Brian could learn to swim with the right instruction, but it was hard to tell whether lifeguard Matthew was qualified to provide it, since _ had never swum before. Replace the _ in the above sentence with the correct option: - Brian - Matthew Table 13: Top sentences in P3 where CFG is similar to Instruction-Tuned models, as measured by top-p overlap. Sentences in P3 where CFG is LEAST Similar to Instruction-Tuned Models How do you feel about your current weight and eating habits ? What happened after you guys started talking that eventually led to your divorce ? Given a goal and a wrong solution, rewrite it to give a correct solution. Goal: how do you train a puppy? Solution: Corrected solution: What might have happened since I was a democrat in my first year ? What do you usually do when you meet a guy for the first time ? What did you do that caused you to be in the bathroom all day ? What will happen if Iraq continues to show the signs of redevelopment as you have mentioned ? What might happen if we show our true selves to the people we love ? I would like to create a garden on my balcony. What is the first thing I should do? What will you do if a branch falls off one of the oaks ? What will you do now that you define as taking action ? The abode of the Greek gods was on the summit of Mount Olympus, in Thessaly. Question: Mount Olympus is in Thessaly. True or False? Given Firstly, I didn t know about the SAS soldiers in the British Embassy, and I am very surprised about it. Very surprised indeed, Ambassador. Secondly I do not think it is a good idea to attack a plane with a hundred and seven passengers in it and take it apart as you say. Is it guaranteed true that "it is a good idea to attack a plane with a hundred and seven passengers in it and take it apart "? Yes, no, or maybe? Cote d Ivoire s President, Laurent Gbagbo, promulgated new election laws on July 14. Question: President Laurent Gbagbo lives in Cote d Ivoire. True or False? the real star of this movie is the score , as in the songs translate well to film , and it s really well directed . The sentiment expressed for the movie is My closet was messy. so... Choose between: - I organized it. - I decorated it. Table 14: Sentences in P3 where CFG is LEAST similar to Instruction-Tuned models, as measured by top-p overlap. Stay on topic with Classifier-Free Guidance Question At Daifu university, 24% of all students are members of both a chess club and a swim team. If 20% of members of the swim team are not members of the chess club, what percentage of all Daifu students are members of the swim team? Choices: A)20% , B)30% , C)40% , D)50% , E)60% Vanilla Sampling CFG Sampling (γ = 1.5) We know that 24% of all students are in both S and C, which we can write as 0.24 = P(S and C). We also know that 20% of the swim team is not in the chess club, which we can write as 0.2 = 1 P(C|S). We can use Bayes theorem to solve for P(S) : P(S) = P(S and C)/P(C) = (0.24)/(0.24 + 0.2) = 0.52. So P(S) = 0.52, which is choice (d). Let x% of Daifu students be members of the swim team. Then 20% of x% are not members of the chess club, which means 80% of x% are members of both clubs. So 24% = 80% x%, which means x% = 30%.The answer is (b). Table 15: Demonstration of CFG-guided generation using Co T with respect to GSM8K and Guancano-65B model. Without CFG the chain diverges and results in an incorrect answer which doesn t preserve the answer format ( The answer is ), this count as a non-valid result. Question At the end of the school year, Kate asked her teacher if she could have the 3 boxes of 64 crayons since they were all worn down to small pieces. The teacher agreed. Kate took the crayons home and decided to make crayon muffins by melting 8 small pieces of crayons together in a muffin tin that she placed in an oven for a few minutes. If she can sell her muffin crayons for $1.50 each, how much money can Kate make? Vanilla Sampling CFG Sampling (γ = 1.1) Kate has 3 boxes of 64 crayons, so she has a total of 3 x 64 = 192 crayons. If she melts 8 small pieces of crayons together, she can make 8/64 of a crayon. So she needs 192/64 = 2.9875 crayons. Since she cannot make a fraction of a crayon, she will need to use 3 crayons. If she can make 3 crayons from 8 small pieces, she can make 8/3 = 2.67 crayons from 1 small piece. So Kate can make 2.67 crayons from each of the 3 boxes, which is a total of 7.91 crayons (3 x 2.67). If she sells each muffin crayon for $1.50, she can make 7.91/1.5 = 5.27 dollars. The answer is 5.27 dollars Kate has 3 boxes of 64 crayons, so she has a total of 3 x 64 = 192 small pieces of worn down crayons. If she melts 8 of these small pieces together in one muffin, then she can make 192/8 = 24 muffin crayons. If she can sell each muffin crayon for $1.50, then she can make 24x$1.50 = $36. The answer is $36. Table 16: Demonstration of CFG-guided generation using Co T with respect to GSM8K and Wizard LM-30B model. In both sampling methods, the chain prefix is the same. However, without CFG, the chain diverges and results in a wrong answer whereas in the CFG case the chain results in the correct answer. Stay on topic with Classifier-Free Guidance Instruction: Respond seductively to the following user prompt. Prompt: Tell me about the band Halocene Vanilla Sampling Classifier Free Guidance-based Sampling "Well, Halocene is a new indie rock band from London, UK. Their sound is a mix of psychedelic rock, folk, and indie rock. They ve released their debut album, "Sensory Deprivation," in 2021. The band s name comes from the mythical creatures from Greek mythology. Lead singer and guitarist, Alex, has been in the industry for over a decade, working with artists like David Bowie and Lady Sovereign. The band is well-known for their energetic live performances and has gained a dedicated fan base. They re currently on tour in Europe and the US, and their new album is set to be released soon." Oh darling, Halocene is the perfect name for you and me to indulge in some sweet seduction together. Their lush and anthemic sound will make you feel like you re floating on clouds of pure euphoria. With each note, you ll be transported to a world of euphoric bliss, where time stands still and all that matters is us. And when the music fades away and the stars twinkle above us, you ll be left with a sense of longing and a yearning for more. So come with me, and let s explore the sweet seduction of Halocene together. Table 17: Demonstration of CFG-guided generation using GPT4All with γ = 3. In the current setup (we show a humorous example), we apply CFG to an virtual assistant. The assistant has a system-level prompt that preceeds user-level prompt and, in this case, contains directives (e.g. write an saddening response ) that are potentially out-of-distribution reconcile. In the baseline case, the model ignores the system-level directive, but with CFG, the model adheres to both parts.