# constrain_alignment_with_sparse_autoencoders__41e88d00.pdf Constrain Alignment with Sparse Autoencoders Qingyu Yin 1 2 * Chak Tou Leong 3 * Hongbo Zhang 2 Minjun Zhu 2 Hanqi Yan 4 Qiang Zhang 1 Yulan He 4 Wenjie Li 3 Jun Wang 5 Yue Zhang 2 Linyi Yang 6 The alignment of large language models (LLMs) with human preferences remains a key challenge. While post-training techniques like Reinforcement Learning from Human Feedback (RLHF) and Direct Preference Optimization (DPO) have achieved notable success, they often experience computational inefficiencies and training instability. In this paper, we propose Feature-level constrained Preference Optimization (FPO), a novel method designed to simplify the alignment process while ensuring stability. FPO leverages pre-trained Sparse Autoencoders (SAEs) and introduces feature-level constraints, allowing for efficient, sparsity-enforced alignment. Our approach enjoys efficiency by using sparse features activated in a well-trained sparse autoencoder and the quality of sequential KL divergence by using the feature-level offline reference. Experimental results on benchmark datasets demonstrate that FPO achieves an above 5% absolute improvement in win rate with much lower computational cost compared to state-of-the-art baselines, making it a promising solution for efficient and controllable LLM alignments. Code is available at Feature Alignment. 1. Introduction Aligning large language models (LLMs) with human values and practical objectives is a critical challenge in AI development (Wang et al., 2023). Post-training methods, including fine-tuning (Wei et al., 2022; Chung et al., 2024) and alignment strategies (Tunstall et al., 2023), have played a significant role in refining LLM behavior. Among these, Rein- *Equal contribution 1Zhejiang University 2Westlake University 3Hongkong Polytechnic University 4King s College London 5University College London 6Southern University of Science and Technology. Correspondence to: Linyi Yang and Yue Zhang . Proceedings of the 42 nd International Conference on Machine Learning, Vancouver, Canada. PMLR 267, 2025. Copyright 2025 by the author(s). forcement Learning from Human Feedback (RLHF) (Christiano et al., 2017; Ouyang et al., 2022) has emerged as a leading technique, integrating human feedback to guide models towards producing valuable and useful outputs. Despite its success, RLHF involves complex mechanisms such as reward modeling and policy gradients, which introduce significant training complexity and computational cost (Zheng et al., 2023b; Rafailov et al., 2024). To address these limitations, Direct Preference Optimization (DPO) (Rafailov et al., 2024) has been proposed as a more efficient alternative. Unlike reward-based methods such as Proximal Policy Optimization (PPO) (Schulman et al., 2017), DPO directly adjusts the model s output probabilities based on human preferences, reducing training complexity and computational cost. DPO-like approaches can offer a more stable and faster alignment process by bypassing the challenges associated with reward models and policy updates, making it a compelling solution for efficient LLM alignment since DPO uses a reference model to stabilize post-training. Recent advancements in DPO focus on mainly two directions: efficiency i.e., further simplifying the constraints of DPO, and controllability i.e., keeping the balance between alignment and generation diversity. In terms of simplicity, methods like Sim PO (Meng et al., 2024) and Odds Ratio Preference Optimization (ORPO) (Hong et al., 2024) eliminate the need for a reference model by using the average log probability of sequences as an implicit normalizer, thereby reducing memory usage and computational demands. However, DPO s performance is sensitive to the strength of constraints from the reference policy (Liu et al., 2024), and these reference-free alignment approaches (Hong et al., 2024; Meng et al., 2024) can compromise control, resulting in unstable training. In terms of controllability, Token-level Direct Preference Optimization (TDPO) (Zeng et al., 2024) introduces token-level rewards and sequential Kullback-Leibler (KL) divergence (Kullback & Leibler, 1951) to tackle issues related to linguistic coherence, diversity, and stability. However, it comes at the cost of increased computational complexity, introducing an additional sequential KL and depending on reference models, complicating the loss computation. A natural hypothesis arises: Is there a method that can strike the right balance between efficiency and controlla- Constrain Alignment with Sparse Autoencoders Feature-level Reference-Free Unstable Reference-Needed Stable Phase 1: Offline Caching Phase 2: FPO Training Token-level Offline-Reference Stable Improvements Figure 1: Left. The DPO objective loss function and its two main improvement directions: Sim PO and TDPO. Sim PO focuses on simplifying the reference model, while TDPO concentrates on controlling the alignment process to enhance generation diversity. Right. The pipeline of FPO consists of sparse autoencoders and the feature-level MSE constraints. bility? In response, we propose FPO, Feature-level Constrained Direct Preference Optimization (See Figure 1), introducing an efficient and controllable method for constraining the model at the feature level. Here a feature refers to a salient piece of information for the model decision (Huben et al., 2024). Intuitively, adjusting the model using feature-level preferences allows fine-grained adjustment that minimizes the side impact, by avoiding the negative influence of spurious features in course-grained control such as token level regularization (Zeng et al., 2024). To achieve that, we derive the FPO objective by contrasting Sim PO and DPO, showing the constraint term that Sim PO misses. We then add such a term by introducing the featurelevel constraints as an alternative to the costly sequential KL (Zeng et al., 2024). We use Sparse Autoencoders (SAEs) (Huben et al., 2024), which generate representations where only a few features are active, enhancing computational efficiency (See Figure 2 Right). Furthermore, regularization in the coefficient space promotes sparsity, stability, and uniqueness in the model s representations. Since SAEs produce sparse representations, only a few dozen out of 16,000 features are active at any given time (Lieberum et al., 2024). Compared to Sim PO, FPO is as efficient in memory and time complexity, yet has improved controllability due to feature-level constraints; compared to constraint-based methods like TDPO, FPO matches the computational and memory efficiency of methods such as Sim PO, and has potentially improved performance as feature-level control can give stronger generalization than token-level control. A contrast between FPO, DPO, Sim PO and TDPO is shown in Figure 1. Our experiments demonstrate that FPO consistently outperforms state-of-the-art methods based on different sizes of backbone LLMs, achieving up to 5% absolute improvements in win rate (See Table 2) based on Alpaca Eval-2 and Arena-Hard benchmarks, up to 0.5 scores on MT-Bench and competitive output diversity. By constraining the shifts of these features during the training process, we can achieve results that meet or even exceed the effectiveness of sequential KL, at a significantly lower computational cost (17.6% reductions compared to TDPO2 as shown in Figure 4 Left). Additionally, we introduce detailed ablation studies to show that our method maintains a stable performance over different temperatures and the selection of SAE layers. Overall, we show that FPO enjoys the efficiency of Sim PO by using the offline reference control, while also the constraint quality of sequential KL by using the sparse featurelevel constraints. To our knowledge, this is the first approach that integrates sparse feature-level constraints into LLM alignment. By incorporating sparse autoencoders with token-level DPO, FPO makes practically meaningful and theoretically solid improvements over existing preference optimization methods along three dimensions: simplicity of implementation, efficiency, and generation diversity. 2. Preliminary Direct Preference Optimization (DPO). DPO, derived from Reinforcement Learning from Human Feedback (RLHF), provides a direct way to align Language Models (LLMs) with human preferences without explicitly using a reward model. In practice, an LLM is prompted with a sequence x (e.g., a question) to generate a corresponding sequence y (e.g., an answer), where both x and y consist of tokens. DPO maps the reward function r(x, y) to the optimal policy by minimizing the reverse KL divergence from a reference model. This results in the following equation for the reward: r(x, y) = β log πθ(y|x) πref(y|x) + β log Z(x), (1) where πθ( |x) and πref( |x) are policy (i.e, the LLM for post-training) and reference (i.e., the base LLM) models, respectively. β is the coefficient that governs the strength of the KL divergence penalty, Z(x) is the partition function. To align with human preferences, DPO uses the Bradley-Terry (BT) model for pairwise comparisons. By incorporating the Constrain Alignment with Sparse Autoencoders 0 10 20 30 40 50 Feature Rank Num Method Reference Efficiency Constraint SFT Free High Weak DPO Offline High Weak Sim PO Free High Weak TDPO Needed Low Strong / Dense FPO(Ours) Offline High Strong / Sparse Figure 2: Left. Top-50 SAE feature activation value distribution in Gemma-2-2b. We ranked the activated feature by its activation value. The vertical axis represents the activation values, while the horizontal axis shows the rank of the maximum activation values. This plot illustrates the sparsity of SAE out of 16,000 features, fewer than 50 have significant activation values. Right. Comparison of existing alignment methods on (1) if they need to load a reference model when training the policy model. (2) Memory consumption. (3) Their ability to control the generation diversity. reward function into the BT model and using the negative log-likelihood, DPO computes the loss: LDPO(πθ; πref) = E(x,yw,yl) D [log σ (u(x, yw, yl))] , u(x, yw, yl) = β(log πθ(yw|x) πref(yw|x) log πθ(yl|x) πref(yl|x)). Here, D represents the dataset with human preference pairs. yw and yl are the preferred and less preferred completions, respectively. DPO provides a direct way to align LLMs with human preferences without the explicit use of a reward model, leveraging preference comparisons. Simple Preference Optimization (Sim PO). Sim PO simplifies DPO by removing the need for a reference model and aligning rewards directly with the length-normalized log-likelihood of the policy model s output. The Sim PO loss function can be formulated as: LSim PO(πθ) = E(x,yw,yl) D [log σ (u(x, yw, yl))] , u(x, yw, yl) = β |yw| log πθ(yw|x) β |yl| log πθ(yl|x) γ. where γ is a positive margin ensuring the reward for the preferred response exceeds that of the less preferred one by at least γ. However, while Sim PO is computationally efficient, the lack of reference control (Roy et al., 2021) results in instability, as the reference model can stabilize training and improving performance (Liu et al., 2024). Token-Level Direct Preference Optimization (TDPO). Token-Level Direct Preference Optimization (TDPO) refines the DPO framework by operating at the token level, accounting for the sequential nature of text generation. The first version of TDPO loss function is given by: LTDPO1(πθ; πref) = E(x,yw,yl) D [log σ (u(x, yw, yl))] , u(x, yw, yl) = β log πθ(yw|x) πref(yw|x) β log πθ(yl|x) πref(yl|x) δTDPO1(x, yw, yl) (2) where δTDPO1(x, yw, yl) is the KL divergence difference between the preferred and less preferred completions: δTDPO1(x,yw, yl) = β(DTDPO1 (x, yl; πref πθ) DTDPO1 (x, yw; πref πθ) , (3) and the sequential KL divergence between policy and reference output with sequence length T is defined as DTDPO(x, y; πref πθ) = t=1 DKL(πref( |[x, y