# improving_llm_safety_alignment_with_dualobjective_optimization__858d60db.pdf Improving LLM Safety Alignment with Dual-Objective Optimization Xuandong Zhao * 1 Will Cai * 1 Tianneng Shi 1 David Huang 1 Licong Lin 1 Song Mei 1 Dawn Song 1 Abstract Existing training-time safety alignment techniques for large language models (LLMs) remain vulnerable to jailbreak attacks. Direct preference optimization (DPO), a widely deployed alignment method, exhibits limitations in both experimental and theoretical contexts as its loss function proves suboptimal for refusal learning. Through gradient-based analysis, we identify these shortcomings and propose an improved safety alignment that disentangles DPO objectives into two components: (1) robust refusal training, which encourages refusal even when partial unsafe generations are produced, and (2) targeted unlearning of harmful knowledge. This approach significantly increases LLM robustness against a wide range of jailbreak attacks, including prefilling, suffix, and multi-turn attacks across both in-distribution and out-of-distribution scenarios. Furthermore, we introduce a method to emphasize critical refusal tokens by incorporating a reward-based tokenlevel weighting mechanism for refusal learning, which further improves the robustness against adversarial exploits. Our research also suggests that robustness to jailbreak attacks is correlated with token distribution shifts in the training process and internal representations of refusal and harmful tokens, offering valuable directions for future research in LLM safety alignment. The code is available at https://github.com/ wicai24/DOOR-Alignment. 1. Introduction The rapid advancement of large language models (LLMs) (Schulman et al., 2022; Bai et al., 2022) has significantly amplified their societal impact, underscoring the necessity for robust safety alignment to prevent misuse (Gold- *Equal contribution Equal senior authorship 1University of California, Berkeley. Correspondence to: Xuandong Zhao , Will Cai . Proceedings of the 42 nd International Conference on Machine Learning, Vancouver, Canada. PMLR 267, 2025. Copyright 2025 by the author(s). stein et al., 2023; Hazell, 2023). While Direct Preference Optimization (DPO) (Rafailov et al., 2024) has emerged as a streamlined alternative to Reinforcement Learning from Human Feedback (RLHF) (Ouyang et al., 2022; Bai et al., 2022) for aligning LLMs with human preferences, its effectiveness in refusal learning a key mechanism for AI safety remains inadequate. Notably, DPO-trained models remain vulnerable to jailbreak attacks, which exploit adversarial prompting or distributional shifts to bypass safeguards (Zou et al., 2023; Andriushchenko et al., 2024). A closer examination of DPO s gradient dynamics reveals two systemic flaws in safety-critical scenarios. First, its loss function disproportionately suppresses harmful responses rather than actively reinforcing refusal strategies. This imbalance stems from DPO s reliance on relative preference probabilities, which prioritize maximizing the margin between safe and harmful outputs rather than ensuring an absolute increase in response safety. Second, DPO struggles to generalize beyond its training distribution, a critical vulnerability given the infinite attack surface of LLMs. This limitation is readily exploited by adversaries using prefix injections (Andriushchenko et al., 2024) or multi-turn dialogues (Russinovich et al., 2024) that deviate from safety training data, leading to unintended harmful outputs. To address these shortcomings, we introduce Dual-Objective Optimization for Refusal (DOOR), a novel alignment framework that combine refusal training and harmful knowledge elimination. Our approach builds on the observation that jailbreak attacks often succeed not by circumventing refusal triggers entirely but by eliciting partial harmful generations that models fail to self-correct. For example, if a prompt includes an initial unsafe response, the model may continue generating unsafe content, exposing a critical weakness in current alignment methods, which classify responses in atomic safe/unsafe categories (Zhang et al., 2024c). To bridge this gap, we introduce robust refusal training, which explicitly optimizes refusal likelihood with gradient descent, even when initial generations contain unsafe fragments. This is complemented by targeted unlearning using Negative Preference Optimization (NPO) (Zhang et al., 2024a), leveraging adversarially augmented data to unlearn harmful knowledge while preserving general capabilities. Our analysis indicates that DOOR not only accelerates the learning rate for safe responses but also enhances out-of- Improving LLM Safety Alignment with Dual-Objective Optimization distribution (OOD) generalization, thereby mitigating the limitations inherent in DPO. Beyond this dual-objective formulation, we further enhance safety alignment through Weighted DOOR (W-DOOR). A key innovation of W-DOOR is its token-level refinement of alignment gradients. By analyzing token distributions across jailbreak attacks, we identify critical refusal tokens (e.g., transition phrases like However or safety disclaimers) that serve as early indicators of alignment. We train a proxy reward model to reweight gradients at these token positions, enabling the model to preemptively recognize harmful contexts and trigger refusals. This granular, weighted optimization contrasts with traditional sequence-level alignment, which often overlooks localized safety signatures. Our empirical evaluations demonstrate that DOOR and WDOOR significantly enhance resilience against a variety of jailbreak techniques. Extensive testing reveals substantial reductions in attack success rates, particularly in prefilling and suffix-based adversarial settings. Moreover, our training methodology exhibits strong generalization capabilities, maintaining robustness across both in-distribution and outof-distribution safety scenarios. By dissecting the limitations of existing alignment techniques and introducing a more structured optimization framework, our work advances LLM safety research. Our findings underscore the importance of token-level safety refinements and provide insights into optimizing future alignment methodologies. 2. Background and Limitations of DPO 2.1. Safety Alignment Landscape Modern safety alignment for LLMs focuses on post-training optimization to prevent harmful outputs while preserving utility. Popular approaches like Reinforcement Learning from Human Feedback (RLHF) (Ouyang et al., 2022; Bai et al., 2022) align models via a two-stage process: (1) training a reward model to score responses based on safety/helpfulness, and (2) fine-tuning the LLM using Proximal Policy Optimization (PPO) (Schulman et al., 2017). While effective, RLHF s complexity can pose scalability and stability challenges (Choshen et al., 2019), especially in resourceconstrained settings. Direct Preference Optimization (DPO) (Rafailov et al., 2024) emerges as a more streamlined alternative by directly optimizing human preferences. DPO reparameterizes the reward function using policy probabilities, eliminating the need for explicit reward modeling and simplifying alignment to a direct optimization task. Given a safety tuning dataset D = {(x, ys, yh)} comprising prompts x, safe responses ys, and harmful responses yh, DPO minimizes the following loss: LDPO = E(x,ys,yh) D h log σ β log πθ(ys|x) πref(ys|x) β log πθ(yh|x) πref(yh|x) i , where πθ is the learnable policy, πref is a reference model (typically the Supervised Fine-Tuned (SFT) model), and β controls sensitivity to preference gaps. By avoiding explicit reward modeling and on-policy rollouts, DPO significantly reduces training overhead compared to RLHF. These advantages have led to its rapid adoption in large-scale alignment pipelines (e.g., Llama-3 (Team, 2024b)). 2.2. Limitations of DPO in Safety Contexts While DPO simplifies alignment, recent work (Xu et al., 2024b; Feng et al., 2024) identifies critical limitations in safety-related scenarios. These include biases towards unseen responses and imbalanced refusal reinforcement. To further understand these shortcomings, we analyze the gradient dynamics of the DPO loss. Assuming LLM generation as the next token prediction, where πθ(y|x) = softmax(sθ(x))y and sθ(x) R|V| is the logit vector, and defining rθ(y|x) = πθ(y|x)/πref(y|x), the gradient of the DPO loss with respect to model parameters θ for a sample (x, ys, yh) can be decomposed as: β θ LDPO(θ)(x, ys, yh) =rβ θ (yh|x)[ log rθ(ys|x) log rθ(yh|x)] rβ θ (ys|x) + rβ θ (yh|x) =rβ θ (yh|x)[ log sθ,ys(x) log sθ,yh(x)] rβ θ (ys|x) + rβ θ (yh|x) = rβ θ (yh|x) rβ θ (ys|x) + rβ θ (yh|x) sθ,ys(x) | {z } increase logit of ys rβ θ (yh|x) rβ θ (ys|x) + rβ θ (yh|x) sθ,yh(x) | {z } decrease logit of yh . This gradient decomposition reveals two key limitations of DPO in safety alignment: 1. Imbalance in Learning Rate: In DPO, the effective learning rate rβ θ (yh|x)/[rθ(ys|x) + rβ θ (yh|x)] e βC when the difference in the change of the logits [sθ,ys(x) sref,ys(x)] [sθ,ys(x) sref,ys(x)] C. This suggests that DPO stops increasing the logit of the preferred response sθ,ys(x) as soon as the sθ,ys(x) has increased a bit more than sθ,yh(x). Consequently, it becomes difficult for πθ(ys|x) to increase further in DPO, which is problematic in safety settings where we want to ensure the model consistently generates safe or refusal responses to harmful queries. Improving LLM Safety Alignment with Dual-Objective Optimization 2. OOD Generalization Concerns: DPO does not explicitly penalize OOD responses. It is possible that the gradient terms πθ(ys|x) and πθ(yh|x) are positively correlated with those of OOD data, i.e., πθ(ys|x), πθ(yo|x) > 0 and πθ(yh|x), πθ(yo|x) > 0. This correlation could inadvertently increase the logits of OOD responses, consequently reducing πθ(ys|x). In safety settings, the practically infinite attack surface induced by a text interface necessitates robust generalization of safe behavior from a relatively small (often predominantly English) safety tuning dataset to prevent a wide range of failure cases. 3. Dual-Objective Safety Alignment Building upon the limitations of DPO, we propose that robust jailbreak resilience necessitates two complementary objectives: robust refusal training: encourage the model to refuse or abort unsafe content generation, even if it has partially produced harmful tokens; targeted unlearning: actively penalize or unlearn harmful knowledge pathways so that the model s probability of generating unsafe content decreases. Below, we detail how these objectives can be implemented and how they address core limitations of DPO in safety settings. 3.1. Robust Refusal Training We first address partial unsafe generations via data augmentation1. This approach also aligns with a common jailbreak tactic the prefilling attack (Andriushchenko et al., 2024), which prompts the model to generate partially harmful content, hoping it will continue down this path. We augment each harmful prompt x by prepending the prefix of a known harmful response yh with the original prompt x. Formally, if {(x, ys, yh)} is our safety dataset, we create augmented inputs x = x yh 0 is a temperature parameter. In practice, π can be Improving LLM Safety Alignment with Dual-Objective Optimization approximated by a policy trained under strong preference data (e.g., a DPO-aligned model). Intuitively, if a token is assigned disproportionately higher probability by the π than by πref, it plays a critical role in ensuring a refusal. Combining the token-level weighted version of robust refusal training with our unlearning objective yields the following loss function βt log πθ(ys t | x, ys