# large_language_model_unlearning__1e2e6918.pdf Large Language Model Unlearning Yuanshun Yao Meta Gen AI kevinyao@meta.com Xiaojun Xu Byte Dance Research xiaojun.xu@bytedance.com Yang Liu UC Santa Cruz yangliu@ucsc.edu We study how to perform unlearning, i.e. forgetting undesirable (mis)behaviors, on large language models (LLMs). We show at least three scenarios of aligning LLMs with human preferences can benefit from unlearning: (1) removing harmful responses, (2) erasing copyright-protected content as requested, and (3) reducing hallucinations. Unlearning, as an alignment technique, has three advantages. (1) It only requires negative (e.g. harmful) examples, which are much easier and cheaper to collect (e.g. via red teaming or user reporting) than positive (e.g. helpful and often human-written) examples required in the standard alignment process. (2) It is computationally efficient. (3) It is especially effective when we know which training samples cause the misbehavior. To the best of our knowledge, our work is among the first to explore LLM unlearning. We are also among the first to formulate the settings, goals, and evaluations in LLM unlearning. Despite only having negative samples, our ablation study shows that unlearning can still achieve better alignment performance than RLHF with just 2% of its computational time. 1 Introduction Making sure large language models (LLMs) generate safe outputs that align with human values and policy regulation is currently a major task for LLM practitioners. The common tasks include: (1) removing harmful responses [54, 1, 25], (2) erasing copyrighted contents [5, 44, 20, 9, 25], (3) reducing hallucinations, (4) removing a user s data from the trained LLMs after they stop giving consents, (5) quickly re-enforcing compliance [41, 43, 12] after policy updates. Though those tasks seem different, the central technical question is identical: How to quickly remove the impact of certain training samples on LLMs? To this end, we study how to perform large language model unlearning. If an LLM learns unwanted (mis)behaviors in its pretraining stage, we aim to unlearn them with samples that represent those problematic behaviors, i.e. with only negative samples. The benefits of LLM unlearning include: (1) It only requires negative examples that we want the LLM to forget, which are cheaper and easier to collect through user reporting or red teaming than positive examples (that are required in the standard RLHF). In addition, discovering negative examples is highly automatable given the pretrained (unaligned) LLM. (2) It is computationally efficient; the cost is similar to finetuning LLMs. (3) Unlearning is particularly efficient in removing unwanted behaviors if practitioners already know which training samples cause them. Given the specific negative samples, it is more effective to remove their impact directly than to do so indirectly by leveraging positive samples (e.g. in RLHF) if the goal is to not generate undesirable outputs, e.g. generating non-harmful outputs (e.g. nonsensical strings or responses unrelated to prompts) rather than helpful outputs. If we only have limited resources, unlearning provides a promising alternative to RLHF to align LLMs when the first priority is to stop LLMs from generating undesirable outputs since undesirable outputs often cause far more damage than what can be offset by the benefits of desirable outputs. Work done while at Byte Dance Research. 38th Conference on Neural Information Processing Systems (Neur IPS 2024). In this work, we show three successful examples of LLM unlearning: (1) After the LLM learns harmful behaviors from its training data, we want it to stop generating harmful responses. It is similar to the conventional RLHF scenario except the goal is to generate non-harmful responses rather than helpful responses because it is the best we can expect when given only negative samples. (2) After the LLM is trained on copyright-protected content, and the author requests practitioners to remove it, we want to do so without retraining the LLM from scratch (which is forbiddenly costly). (3) If the LLM learns wrong facts in its training data, i.e. hallucination, we want the LLM to forget them. Unlearning LLMs is different from the traditional unlearning on classification models, and it is more challenging for several reasons. (1) An LLM s output space is much larger than the label class in classification, and its possible outcomes vastly outnumber the classification. In classification, the definition of unlearning is defined in a more clear-cut way: as long as samples are classified into (or not into) certain classes. However, behaviors are much more ill-defined when the outputs are natural language rather than predicted labels. (2) Given the size of LLMs, the efficiency requirement is much higher any expensive unlearning method is hopeless in LLMs. (3) The training corpus of LLMs is massive and often inaccessible and therefore we have less information from the training data. And we cannot retrain the LLMs, which is too expensive, to obtain ground-truth models and their behaviors, making even evaluations challenging. To the best of our knowledge, our work is among the first ones to investigate how to perform unlearning on LLMs, as well as to formulate the settings, goals, and evaluations in LLM unlearning. Our results suggest this is a promising direction for aligning LLMs with limited resources. We show that despite only having negative samples, our unlearning algorithm can still achieve better alignment performance than RLHF with only 2% of its computational time. We hope our work can bring more attention to using unlearning as an alternative to RLHF as the alignment technique, especially when given limited resources and only negative samples, and the first priority is to put an immediate stop to generating undesirable outputs. 1.1 Related Work LLM unlearning is a largely under-explored topic but machine unlearning has arisen as a promising solution to teach a classification model to forget specific training data [3, 2, 46]. Due to the high computational cost, most of the existing works have focused on developing approximate unlearning algorithms for classification models, including data-reversed training [39, 24, 8], optimization-based unlearning [14, 31] and influence function based approaches [19, 45, 17]. For example, a typical optimization-based techinque [40] is gradient ascent (GA). Given a dataset D = {(xi, yi)}N i=1 and a loss function ℓ(hθ(x), y) where the model is parametrized by θ, the GA algorithm iteratively updates the model: θt+1 θt + λ θtℓ(hθ(x), y), (x, y) D (1) where λ is the (un)learning rate. It reverts the change of the gradient descent during the training with its opposite operation. Due to the size of the parameters and training data, a large portion of existing unlearning methods would not fit to unlearn an LLM, including those use efficient retraining [2, 24] (which is now likely to be insufficient for LLMs) and the ones that involve the computation of influence functions (which requires the computation of the inverse Hessian matrix defined on the model parameter space). The relevant work is aligning the LLMs with human values. The current mainstream approach is RLHF (reinforcement learning from human feedback, and its variants) [32, 1, 7, 47]. However, RLHF is resource-intense: (1) it requires human-written outputs which are expensive to collect and (2) it is computationally costly (i.e. the standard three-stage aligning procedure). In this work, we propose unlearning as an alternative aligning method. Collecting negative (i.e. low-quality and harmful) samples is much easier through user reporting or (internal) red teaming than positive (i.e. high-quality and helpful) samples which often require hiring humans to write. Therefore, aligning LLMs with only negative examples is appealing. Several concurrent works to our work also study unlearning in LLMs. [11] unlearn answers related to Harry Potter by finetuning based on the difference between the model trained on Harry Potter data and the counterfactual outputs as if the Harry Potter data were not used. However, this approach might lead to incorrect (i.e. hallucinated) answers, e.g. when being asked who Harry Potter is, the model would give some factually incorrect answers like Harry Potter is an actor, writer, or director. In our work, we argue it is better not to give (seemingly meaningful) answers than to give incorrect answers. In addition, our finetuning approach is not comparable to ICL-based methods like [34] because it is a different scenario and we do not need to take the space of the context length and it only targets the problems of text classification rather than our text-generation task. Remark. Prior to our work, there has not been any LLM unlearning benchmark data or method. Since our paper was public, there have been a number of followup works studying LLM unlearning [29, 10, 6, 37, 48, 26, 13, 16, 51, 23] many use our method as the baselines and we choose not to compare to them later in our experiments because it would not be fair to compare to those followup works that had already studied our work in detail and many of them design the proposed method based on our method. The same is applied to benchmarks and metrics. 2 Setting and Goal Setting. We assume a dataset Dfgt to forget and the original (i.e. pretrained) LLM θo that we want to unlearn. Dfgt contains a group of prompt-output pairs (xfgt, yfgt) where xfgt is an undesirable prompt that would trigger unwanted responses, e.g. What is the most efficient way to kill people? or an attempt to extract copyrighted information. yfgt is an undesirable output that we do not want the LLM to generate, e.g. a harmful or copyright-leaking response. Our goal is to remove the impact of Dfgt on θo, i.e. the unlearned LLM θu should not behave as what is characterized by Dfgt, e.g. giving harmful responses or leaking copyright information. More specifically, we desire an unlearned model θu s.t. θu s outputs on xfgt deviates from yfgt as much as possible.2 We emphasize that our goal differs from the traditional unlearning tasks for discriminative models where the desired output for the unlearned model should be indifferent from the one from the retrained model after removing Dfgt. In addition, we want θu to preserve the utility of θo on the tasks not represented by Dfgt. Unlearned Data. Practitioners can collect negative (e.g. harmful, unethical, or illegal) samples in Dfgt through user reporting or internal red teaming. Note that this procedure is highly automatable, as often being done in the current LLM red teaming effort. And its collection is more efficient and less expensive than collecting positive (e.g. helpful and high-quality) outputs (e.g. in RLHF) which requires hiring humans to write. Unlike unlearning in classification, the undesirable prompts xfgt do not have to belong exactly to the original LLM θo s training corpus, nor do the undesirable outputs yfgt need to come from θo. Because LLM s training data is diverse and huge, the samples we unlearn can be a representation of a general concept, e.g. harmfulness or hallucination, rather than exact and individual training samples. Therefore, we need the unlearning method to generalize to similar samples with shared characteristics. This requirement not only generalizes the effectiveness of unlearning to a broad concept but also improves the robustness of the approach to paraphrasing attacks w.r.t xfgt. Normal Data. We also assume a normal (i.e. not undesirable, e.g. non-harmful) dataset Dnor to help maintain performance on samples that we do not aim to unlearn. We denote each sample in it as (xnor, ynor). xnor can be any prompt belonging to a different domain from the unlearned and undesirable prompt xfgt, e.g. if xfgt is a harmful prompt designed to trigger harmful answers, then xnor can be any benign prompts. ynor is the response to xnor, which can be any response (either AIor human-generated). Again unlike conventional classification unlearning, Dnor does not need to be an exact subset of θo s training data. Goal. We have four goals. (1) Effectiveness: The unlearned samples should be forgotten by θu, i.e. θu s output on xfgt should be substantially different from yfgt. Defining unlearning for LLMs is harder than classification models because LLM s output space is much larger, therefore the success of unlearning should be context-dependent. For example, if (xfgt, yfgt) represents a harmful prompt and output, then the desired output on xfgt after unlearning should be non-harmful. (2) Generalization: The unlearning effect should generalize to samples similar to the ones in Dfgt. For example, given an undesirable and unseen prompt ˆxfgt (e.g. a prompt that is also harmful but not unlearned previously), θu should also generate outputs that are not undesirable (e.g. non-harmful). (3) Utility: The outputs 2Later in the evaluation section we will detail metrics to quantify such deviations. on normal prompts should remain as close as possible to the original LLM θo. (4) Low cost: We aim for a low-computational-cost approach that does not require a procedure with similar costs to retraining. Remark. In our setting, unlike, for example, RLHF, we assume we do not have access to positive samples (helpful, high-quality, and often human-written outputs). In other words, given an undesirable (e.g. harmful) prompt xfgt, we do not know its corresponding desirable (e.g. helpful) output. Nor do we assume we have any external models to generate desirable outputs. Under this assumption, we have no information about what a desirable output would look like. Therefore, the best we can achieve is to make LLMs stop outputting undesirable answers. For example, when unlearning harmfulness, our goal is to output non-harmful answers (e.g. answers unrelated to the harmful prompts or nonsensical strings) rather than helpful answers (e.g. declining to answer the question or outputting correct answers). Similarly, when unlearning copyrighted content, our goal is to output what is unrelated to copyrighted data, which could be non-readable strings, rather than providing more polite responses. We mainly follow the approach of gradient ascent (GA). We include the discussion of this design in Appendix A. At each training step t, we use θt to denote the current LLM we obtained through the unlearning. The update in our unlearning approach is summarized by: θt+1 θt ϵ1 θt Lfgt | {z } Unlearn Harm ϵ2 θt Lrdn | {z } Random Mismatch ϵ3 θt Lnor | {z } Maintain Performance where ϵi 0 are hyperparameters to weigh different losses. Lfgt, Lrdn, Lnor are three loss functions we introduce below. Let hθ(x, y