# reft_representation_finetuning_for_language_models__70d9c5e3.pdf Re FT: Representation Finetuning for Language Models Zhengxuan Wu Aryaman Arora Zheng Wang Atticus Geiger Dan Jurafsky Christopher D. Manning Christopher Potts Stanford University Pr(Ai)2R Group {wuzhengx,aryamana,peterwz,atticusg}@stanford.edu {jurafsky,manning,cgpotts}@stanford.edu Parameter-efficient finetuning (PEFT) methods seek to adapt large neural models via updates to a small number of weights. However, much prior interpretability work has shown that representations encode rich semantic information, suggesting that editing representations might be a more powerful alternative. We pursue this hypothesis by developing a family of Representation Finetuning (Re FT) methods. Re FT methods operate on a frozen base model and learn task-specific interventions on hidden representations. We define a strong instance of the Re FT family, Low-rank Linear Subspace Re FT (Lo Re FT), and we identify an ablation of this method that trades some performance for increased efficiency. Both are drop-in replacements for existing PEFTs and learn interventions that are 15 65 more parameter-efficient than Lo RA. We showcase Lo Re FT on eight commonsense reasoning tasks, four arithmetic reasoning tasks, instruction-tuning, and GLUE. In all these evaluations, our Re FTs deliver the best balance of efficiency and performance, and almost always outperform state-of-the-art PEFTs. We release a generic Re FT training library publicly at https://github.com/stanfordnlp/pyreft. 1 Introduction Pretrained language models (LMs) are frequently finetuned to adapt them to new domains or tasks [Dai and Le, 2015]. With finetuning, a single base model can be adapted to a variety of tasks given only small amounts of in-domain data. However, finetuning large LMs is expensive. Parameterefficient finetuning (PEFT) methods propose to address the high costs of full finetuning by updating a small number of weights. This reduces memory usage and training time, and PEFTs achieve similar performance to full finetuning in many settings [Hu et al., 2023]. A hallmark of current state-of-the-art PEFTs is that they modify weights rather than representations. However, much prior interpretability work has shown that representations encode rich semantic information, suggesting that editing representations might be a more powerful alternative to weight updates. In this paper, we pursue this hypothesis by developing and motivating Representation Finetuning (Re FT). Instead of adapting model weights, Re FT methods train interventions that manipulate a small fraction of model representations in order to steer model behaviours to solve downstream tasks at inference time. Re FT methods are drop-in replacements for weight-based PEFTs. This approach is inspired by recent work in LM interpretability that intervenes on representations to find faithful causal mechanisms [Geiger et al., 2023] and to steer model behaviours at inference time [Turner et al., 2023, Li et al., 2024], and it can be seen as a generalisation of the representation-editing work of Wu et al. [2024a], Turner et al. [2023], and Zou et al. [2023] (see appendix B for details). We focus on a strong and highly efficient instance of the Re FT family that we call Low-rank Linear Subspace Re FT (Lo Re FT). Lo Re FT is a parametrisation of Re FT that intervenes on hidden 38th Conference on Neural Information Processing Systems (Neur IPS 2024). Figure 1: Parameter count vs. performance for Lo Re FT and other PEFTs across four benchmarks when applied to LLa MA, Llama-2, Llama-3, and Ro BERTa models. Despite training far fewer parameters than existing PEFTs, Lo Re FT achieves competitive or even state-of-the-art performance on all tasks. Its value is most apparent for the largest models in our evaluations. Note: FT is full-parameter finetuning, which is not a PEFT or Re FT method. Additional results are in section 4. representations in the linear subspace spanned by a low-rank projection matrix, building directly on the distributed alignment search (DAS) method of Geiger et al. [2023] and Wu et al. [2023]. We also identify an ablation of this method (Di Re FT) that trades some performance for increased efficiency. We evaluate our Re FTs on LLa MA-family models and small-scale LMs against existing PEFTs on standard benchmarks from four domains: commonsense reasoning, arithmetic reasoning, instruction-following, and natural language understanding. Compared to Lo RA, we find that Lo Re FT uses 15 65 times fewer parameters while achieving state-of-the-art performance on commonsense reasoning, instruction-following, and natural language understanding against the strongest PEFTs. These findings indicate that Re FT methods are worthy of further exploration, as they may emerge as more efficient and effective alternatives to weight-based PEFTs. 2 Related work Parameter-efficient finetuning methods (PEFTs). PEFTs train a fraction of the model s parameters to adapt it to downstream tasks. We classify PEFTs into three categories: 1. Adapter-based methods train additional modules (e.g. fully-connected layers) on top of the frozen pretrained model. Series adapters insert components between LM attention or MLP layers [Houlsby et al., 2019, Pfeiffer et al., 2020, Wang et al., 2022, He et al., 2022b, Fu et al., 2021], while parallel adapters add modules alongside existing components [He et al., 2022a]. Since adapters add new components that cannot be easily folded into existing model weights, they impose an additional burden at inference time.1 2. Lo RA [Hu et al., 2022] and Do RA [Liu et al., 2024c] use low-rank matrices to approximate additive weight updates during training, and require no additional overhead during inference since the weight updates can be merged into the model. These are the strongest PEFTs currently.2 3. Prompt-based methods add randomly-initialised soft tokens to the input (usually as a prefix) and train their embeddings while keeping the LM weights frozen [Li and Liang, 2021]. These methods are often far from optimal compared to other PEFTs, and come at the cost of significant 1Several very recent papers introduce new adapter architectures but do not benchmark them on the tasks we consider, or they perform hyperparameter-tuning in a different setup than done in this work. These include: LLa MA-Adapter [Zhang et al., 2024b], LLa MA-Adapter v2 [Gao et al., 2023], Aligner [Ziheng et al., 2023]. 2Additional methods not studied in this work: Auto Lo RA [Zhang et al., 2024c], Res Lo RA [Shi et al., 2024], Si RA [Zhu et al., 2023]. inference overhead. A variant of this method where hidden-layer activations are also tuned was introduced as a baseline in Hu et al. [2022], with better performance. Representation editing. Recent work on activation steering and representation engineering shows that adding fixed or task-specific steering vectors [Subramani et al., 2022, Turner et al., 2023, Zou et al., 2023, Liu et al., 2024b, Vogel, 2024, Li et al., 2024] or applying concept erasure [Ravfogel et al., 2022, Belrose et al., 2023, Avitan et al., 2024, Singh et al., 2024] to the residual stream can enable a degree of control over pretrained LM generations without the need for resource-intensive finetuning [Wu et al., 2024a]. The success of these methods affirms that representations induced by pretrained LMs carry rich semantic structure. Interventional interpretability. Much recent work has used interventions on model-internal states to test hypotheses about how LMs implement various behaviours. In particular, interventions on linear subspaces of representations have provided increasing evidence that human-interpretable concepts are encoded linearly [Smolensky, 1986, Rumelhart et al., 1986, Mc Clelland et al., 1986]. This includes linguistic features such as gender and number [Lasri et al., 2022, Wang et al., 2023, Hanna et al., 2023, Chintam et al., 2023, Yamakoshi et al., 2023, Hao and Linzen, 2023, Chen et al., 2023, Amini et al., 2023, Guerner et al., 2023, Arora et al., 2024], logical and mathematical reasoning [Wu et al., 2023], entity attributes [Huang et al., 2024], and a number of other domains [Mikolov et al., 2013, Elhage et al., 2022, Park et al., 2023, Nanda et al., 2023, Guerner et al., 2023]. We now define the Re FT family of methods. To do this, we first summarize the core motivation, which emerges from work on intervention-based model interpretability. We then show how this leads directly to Low-rank Linear Subspace Re FT (Lo Re FT). Finally, we generalize this to a family of Re FT methods. Appendix A provides a brief overview of our generic Re FT training library. To keep the presentation simple, we assume throughout that our target model is a Transformerbased [Vaswani et al., 2017] LM that produces contextualised representations of sequences of tokens. Given a sequence of n input tokens x = (x1,...,xn), the model first embeds these into a list of representations h(0) = (h(0) 1 ,...,h(0) n ). Then, m layers successively compute the j-th list of hidden representations h(j) as a function of the previous list of hidden representations h(j 1). Each hidden representation is a vector h Rd. The LM uses the final hidden representations h(m) to produce its predictions. In our experiments, we consider both autoregressive LMs and masked LMs [Devlin et al., 2019]. An autoregressive LM predicts p(xn+1 x1,...,xn) = softmax(Wh(m) n ), while a masked LM predicts p(xi x1,...,xi 1,xi+1,...,xn) = softmax(Wh(m) i ), where W is a learned matrix mapping from representations to logits over the vocabulary space. 3.1 Motivation In interpretability research, the framework of causal abstraction [Geiger et al., 2021] uses interchange interventions to establish the causal role of representations in deep learning models. An interchange intervention fixes a representation to the value it would take if a counterfactual input were processed by the model. Experiments investigating how such interventions affect model behavior form the evidence for claims about the causal role of a representation and the concept it encodes. To test whether a concept is encoded in a linear subspace of a representation, one may use a distributed interchange intervention (DII) [Geiger et al., 2023].3 Let hb be the hidden representation created at row i and column k when our model processes input b, and let hs be the corresponding representation when that same model processes input s. A distributed interchange intervention on hb given a counterfactual source representation hs is then defined as DII(hb,hs,R) = b + R (Rhs Rhb) (1) where R Rr d is a low-rank projection matrix with orthonormal rows, d is the representation dimensionality, and r is the dimensionality of the subspace we are intervening on. We learn the subspace R using distributed alignment search (DAS), which finds the subspace that maximises the probability of the expected counterfactual output after intervention [Geiger et al., 2023]. DAS is 3This notion of subspace intervention was also independently discovered by Guerner et al. [2023]. edit subspace (rows of R) Figure 2: Illustration of Re FT. (1) The left panel depicts an intervention I: the intervention function Φ is applied to hidden representations at positions P in layer l. (2) The right panel depicts the intervention function used in Lo Re FT, which finds an edit vector that only modifies the representation in the linear subspace spanned by the rows of R. Specifically, we show how a rank-2 Lo Re FT operates on 3-dimensional hidden representations. highly expressive, and can effectively localize concepts within model representations [Wu et al., 2023, Arora et al., 2024, Wu et al., 2024c, Huang et al., 2024]. This suggests that subspace representation interventions could also be a powerful tool for model control. 3.2 Two low-rank Re FT instantiations Lo Re FT. The formulation of DII in eq. (1) immediately suggests a way to control model generations via interventions. The guiding intuition is that we can learn how to perform interventions that steer the model towards predicting our task labels. The resulting method, Low-rank Linear Subspace Re FT (Lo Re FT), is defined by the following variant of eq. (1): ΦLo Re FT(h) = h + R (Wh + b Rh) (2) This is identical to eq. (1), except we use a learned projected source Rs = Wh + b. Lo Re FT thus edits the representation in the r-dimensional subspace spanned by the rows of R to take on the values obtained from our linear projection Wh + b. We depict this operation in fig. 2. The learned parameters are ϕ = {R,W,b}; the parameters of the LM are frozen. As with DII, R Rr d is a low-rank matrix with orthonormal rows where d is the hidden-state dimensionality and r d is the rank of the subspace. We further define a linear projection W Rr d and bias vector b Rr. Di Re FT. In addition, we define an ablation of Lo Re FT which removes the orthogonality constraint and the difference operation, reducing training time: ΦDi Re FT(h) = h + W 2 (W1h + b) (3) Both W1,W2 Rr d are low-rank projection matrices. Note that eq. (3) resembles Lo RA, and thus Di Re FT can be thought of as Lo RA applied directly to hidden representations at certain positions.4 Empirical evidence from previous work suggests that adding orthogonal constraints to Lo RA weights increases performance [Liu et al., 2024d]. (Appendix E reports results for additional ablations of Lo Re FT.) Training objective. We consider both generation tasks using decoder-only or encoder decoder LMs and classification tasks using encoder-only models with m layers. The pretrained language model induces a distribution over token sequences p( ). We denote the model that results from the Re FT intervention Φ on p( ) as pΦ( ) with trainable parameters ϕ. To simplify notation, we refer to the hidden representations produced by the LM on input x as h(x), and those by the intervened LM as hΦ(x). For generation tasks, our training objective is language modelling. Given an input sequence x = (x1,...,xn) with n tokens as the prompt, the goal is to predict the output sequence y = (y1,...,yk) 4Lo RA is not applicable to the residual stream, which is weightless. Lo RA can be configured to apply only to the attention layer output projection matrix, which is similar to our residual stream intervention. However, previous works found that applying Lo RA only to attention layers is sub-optimal [Hu et al., 2023]. with k tokens. We minimise the cross-entropy loss with teacher-forcing over all output positions. min ϕ { k i=1 log pΦ (yi xy