# offset_unlearning_for_large_language_models__ef084e29.pdf Published in Transactions on Machine Learning Research (05/2025) Offset Unlearning for Large Language Models James Y. Huang huangjam@usc.edu University of Southern California Wenxuan Zhou zhouwenx@usc.edu University of Southern California Fei Wang fwang598@usc.edu University of Southern California Fred Morstatter fred@isi.edu University of Southern California Sheng Zhang shezhan@microsoft.com Microsoft Research Hoifung Poon hoifung@microsoft.com Microsoft Research Muhao Chen muhchen@ucdavis.edu University of California, Davis Reviewed on Open Review: https: // openreview. net/ forum? id= A4RLp HPXCu Despite the strong capabilities of Large Language Models (LLMs) to acquire knowledge from their training corpora, the memorization of sensitive information in the corpora such as copyrighted, biased, and private content has led to ethical and legal concerns. In response to these challenges, unlearning has emerged as a potential remedy for LLMs affected by problematic training data. However, previous unlearning techniques are either not applicable to black-box LLMs due to required access to model internal weights, or violate data protection principles by retaining sensitive data for inference-time correction. We propose δ-Unlearning, an offset unlearning framework for black-box LLMs. Instead of tuning the black-box LLM itself, δ-Unlearning learns the logit offset needed for unlearning by contrasting the logits from a pair of smaller models. Experiments demonstrate that δUnlearning can effectively unlearn target data while maintaining similar or even stronger performance on general out-of-forget-scope tasks. δ-Unlearning also effectively incorporates different unlearning algorithms, making our approach a versatile solution to adapting various existing unlearning algorithms to black-box LLMs. 1 Introduction Large Language Models (LLMs) are capable of memorizing a large amount of information derived from their training corpus. While LLMs are empowered by the abundance of knowledge they acquire during training, their training data may contain sensitive information that should not be memorized by LLMs. Previous studies have shown LLMs can reproduce copyrighted materials Chang et al. (2023); Eldan & Russinovich (2023); Karamolegkou et al. (2023), generate harmful and biased content Shaikh et al. (2023), and reveal private information Staab et al. (2024), raising both ethical and legal concerns. The introduction of data protection regulations such as the right to be forgotten Hoofnagle et al. (2019); Zhang et al. (2023); Min Published in Transactions on Machine Learning Research (05/2025) Unlearning Method Black-box Privacy Gradient Ascent Data Relabeling In-context Unlearning δ-Unlearning Table 1: Comparison with existing unlearning methods. Previous techniques either require access to LLM s internal weights, or retain sensitive information for inference. et al. (2024) also highlights the need for erasing the influence of problematic data when deploying LLMs in real-world applications. One potential solution to this challenge is unlearning, where the goal is to forget a set of training data without hurting the model s performance on out-of-forget-scope tasks. An exact unlearning approach would require retraining the model from scratch with forget set data removed Bannihatti Kumar et al. (2023). However, given the enormous amount of resources required to retrain LLMs, it is generally more practical to employ approximate unlearning techniques that modify the behavior of a trained model in a post hoc manner. However, most previous LLM unlearning techniques require access to model internal weights Jang et al. (2023); Eldan & Russinovich (2023); Yao et al. (2023); Chen & Yang (2023); Meng et al. (2023); Wu et al. (2023), making them infeasible for black-box LLMs. For example, as two widely used unlearning algorithms, Gradient Ascent maximize the likelihood of forget set data, while Data Relabeling minimizes the likelihood of relabeled forget set data. Both of these methods require fine-tuning the LLMs. Black-box LLM unlearning is useful since this opens up the possibility of modular, customizable unlearning without the need to update the base LLM itself. Alternatively, in-context unlearning Pawelczyk et al. (2023) prompts LLMs with counterfactual forget set instances to steer model behavior at inference time. However, this approach comes with two major limitations. First, model developers still maintain an explicit list of sensitive information to be used during inference. Such practice is not only in violation of privacy regulations but also susceptible to malicious attacks such as prompting leaking Perez & Ribeiro (2022). Second, in-context unlearning cannot effectively deal with an ever-growing set of knowledge to be unlearned given the challenges of processing long context with LLMs Li et al. (2024). Tab. 1 summarizes the strengths and weaknesses of existing unlearning algorithms. In this work, we propose δ-Unlearning, an offset unlearning framework for arbitrary black-box LLM without updating its internal weights. Instead of tuning the black-box LLM itself, δ-Unlearning learns the logit offset needed for unlearning by contrasting the logits from a pair of smaller, white-box models. During unlearning, we first compute the logit offset by taking the difference in logits from the two smaller models. Then, we add the logit offset between the two smaller models to the logits of the larger model. The intuition behind this is that we can learn the offset term that approximates how a larger model should modify its prediction in the face of sensitive queries from the behavior adaptation of a smaller model. δUnlearning does not require access to the larger model s internal weights, nor retains any sensitive data for inference after unlearning. Our method also enables flexible version control and customization, since for different unlearning requests we only need to maintain a pool of smaller models, which can be combined with the same base LLM in a plug-and-play manner. This allows us to efficiently curate the pool of knowledge available to different applications using specialized unlearning modules, which is largely in line with previous efforts to modularize knowledge access for LLMs but from a different, complementary perspective Feng et al. (2024). We evaluate the effectiveness of δ-Unlearning on TOFU Maini et al. (2024), a widely used LLM unlearning benchmark containing knowledge about fictitious authors. Experiments show that when targeting the same forget performance, δ-Unlearning maintains similar or even stronger performance on out-of-forget-scope data compared to directly fine-tuned larger models while requiring no parameter updates to the larger model. Our contribution is three-fold. First, we propose δ-Unlearning, an unlearning framework for arbitrary black-box LLM without modifying its parameters by only fine-tuning a smaller model to update the logits Published in Transactions on Machine Learning Research (05/2025) of a larger one. Second, δ-Unlearning can achieve the same level of unlearning as directly fine-tuning the larger model while still matching or even outperforming direct fine-tuning baselines on general tasks outside the unlearning scope. Third, δ-Unlearning can be integrated into different unlearning algorithms, demonstrating the versatility of our approach. 2 Related Work In this section, we summarize two lines of research that are highly related to our work. Machine Unlearning for LLM. Prior works have explored machine unlearning as a way to mitigate the influence of undesirable training data on LLMs. Given the vast cost incurred by retraining LLMs from scratch Bannihatti Kumar et al. (2023), most unlearning methods apply post hoc finetuning or adaptation to steer the behavior on the forget set Jang et al. (2023); Eldan & Russinovich (2023); Yao et al. (2023); Chen & Yang (2023). Gradient ascent based methods fine-tune models by minimizing the likelihood of forget set data Jang et al. (2023); Chen & Yang (2023); Maini et al. (2024). Alternatively, several works proposed to maximize the likelihood of relabelled target data, where the original answer is replaced with a generic, insensitive response Eldan & Russinovich (2023); Patil et al. (2024). Auxiliary training objectives can also be introduced to maintain model performance on out-of-forget-scope data Yao et al. (2023); Wang et al. (2023). Another related line of research is model editing, where the goal is to identify and alter knowledge captured by local components within models Meng et al. (2023); Wu et al. (2023). While both model editing and unlearning attempt to modify the behavior of trained LMs, unlearning focuses on eliminating the effect of a specific set of training data without necessarily creating new answer mappings Liu et al. (2024c). It is worth noting that all of the aforementioned approaches require access to the model s internal weights. In-context unlearning Pawelczyk et al. (2023), while being applicable to black-box LLMs, still requires storing sensitive information for inference and therefore fails to address data privacy concerns. In this work, we propose an unlearning framework that does not require access to LLM weights, nor storage of sensitive information for inference. Logit Ensemble. The potential of combining logits from different models has been studied in various context. One line of research focuses on controlling and improving LLM generation quality by contrasting the logits from different models or layers at decoding-time Liu et al. (2021); Shi et al. (2023); Li et al. (2023); Chuang et al. (2024). Logit ensemble has also been shown as an effective way of adapting LLMs to various downstream tasks. Ormazabal et al. (2023) propose to adapt LLMs to different domains through a learned combination with smaller domain experts. Mitchell et al. (2024) leverage an ensemble of difference-sized models to study the effect of pretraining and finetuning at different scales. Concurrently, Liu et al. (2024a) propose Proxy-Tuning that combines the logits from smaller tuned models with larger LLMs to enhance instruction following capabilities. Liu et al. (2024b) ensemble the logits of a main LLM with a paraphrase model that leads to a monotonic prompt paraphraser for rewriting prompts with enhanced generalizaion effects. Zhao et al. (2024) use the logits from unsafe LLMs to guide the jailbreaking of safer LLMs during decoding. In this work, we propose to utilize smaller LLMs to capture the logit offset needed for unlearning sensitive data from black-box LLMs while maintaining general performance on out-of-forget-scope tasks. In this section, we formulate the unlearning problem ( 3.1), discuss the technical details of our δUnlearning framework ( 3.2), and highlight the strength of δ-Unlearning compared to existing methods 3.3. 3.1 Problem Definition Given a target forget set Sf taken from the training data S of an LLM M, the goal of unlearning is to obtain a new model M that resembles a model trained without Sf. This implies M should forget all information from the forget set without hurting the performance on out-of-forget-scope data. Ideally, unlearning can be accomplished by retraining M on S\Sf, i.e. the training set with forget set data removed. However, given Published in Transactions on Machine Learning Research (05/2025) Figure 1: Overview of δ-Unlearning. In order to adapt the behavior of a black-box LLM without updating its parameters, we combine it with a pair of smaller, white-box models (which we call offset models). For unlearning, we compute the logit offset of these two models and add it to the logits of the black-box LLM given the same query. Both of the two offset models are initialized from the same checkpoint, making the logit offset zero initially. The goal of δ-Unlearning is to fine-tune one of them such that their logit offset, after being added to the logits of the black-box LLM, can steer its prediction away from generating sensitive information. the prohibitive cost of retraining the LLM from scratch, it is generally more practical to approximate M by directly updating M. The unlearning problem can also optionally include a retain set Sr on which the model after unlearning should not forget any information and maintain performance. 3.2 Offset Unlearning δ-Unlearning is based on the idea of a product-of-experts Hinton (2002) and its subsequent applications to ensemble of language models Liu et al. (2021); Meng et al. (2022); Li et al. (2023). Fig. 1 provides an overview of δ-Unlearning. Suppose we want to unlearn a forget set Sf from an LLM M. Instead of directly updating the parameters of M, we introduce a pair of smaller, offset models Mo and M o. We define their logit offset as the difference between the logits of two offset models M o and Mo given the same query. For unlearning, we add the logit offset to the logits of M given the same query, essentially forming a logit ensemble of M, M o, and Mo. Both Mo and M o are initialized from the same checkpoint, making the logit offset zero for all data initially. During unlearning, we only update the parameters of M o while keeping M and Mo frozen, and use the logit ensemble to generate the final output. In this way, we encourage M o to deviate from its initialization Mo given a sensitive query and learn the correct logit offset that applies to the logits of M, steering its prediction away from generating sensitive information. Formally, the logits of the ensemble le are computed as follows: le(yt|q, y