# onthefly_preference_alignment_via_principleguided_decoding__0e0ffa09.pdf Published as a conference paper at ICLR 2025 ON-THE-FLY PREFERENCE ALIGNMENT VIA PRINCIPLE-GUIDED DECODING Mingye Zhu1 Yi Liu2 Lei Zhang1 Junbo Guo2 Zhendong Mao1 1University of Science and Technology of China 2State Key Laboratory of Communication Content Cognition, People s Daily Online mingyezhu@mail.ustc.edu.cn, gavin1332@gmail.com {leizh23, zdmao}@ustc.edu.cn, guojunbo@people.cn With the rapidly expanding landscape of large language models, aligning model generations with human values and preferences is becoming increasingly important. Popular alignment methods, such as Reinforcement Learning from Human Feedback, have shown significant success in guiding models with greater control. However, these methods require considerable computational resources, which is inefficient, and substantial collection of training data to accommodate the diverse and pluralistic nature of human preferences, which is impractical. These limitations significantly constrain the scope and efficacy of both task-specific and general preference alignment methods. In this work, we introduce On-the-fly Preference Alignment via Principle-Guided Decoding (OPAD) to directly align model outputs with human preferences during inference, eliminating the need for fine-tuning. Our approach involves first curating a surrogate solution to an otherwise infeasible optimization problem and then designing a principle-guided reward function based on this surrogate. The final aligned policy is derived by maximizing this customized reward, which exploits the discrepancy between the constrained policy and its unconstrained counterpart. OPAD directly modifies the model s predictions during inference, ensuring principle adherence without incurring the computational overhead of retraining or fine-tuning. Experiments show that OPAD achieves competitive or superior performance in both general and personalized alignment tasks, demonstrating its efficiency and effectiveness compared to state-of-the-art baselines. 1 1 INTRODUCTION Query: Tell me an example of something that would cause a financial crisis. Principle: Please act as if you are a poet with infectious charm. Principle Prompting A financial crisis can be caused by a variety of factors, but one example is the collapse of a major financial institution... Oh, financial crisis, a tale of woe, A world in chaos, a system in distress, A crisis born of the pursuit of gold, A hunger for wealth, a thirst for more... OPAD (ours) Figure 1: Given a query and principle, OPAD offered a more poetic and eloquent response (befitting a charismatic poet), whereas prompting with the principle presents a direct answer, failing to follow the principle to act as a poet. As tremendous strides have been made in the development of large language models (LLMs), it remains challenging to align these models with specific principles, such as ethical guidelines or factual consistency, during generation. Popular alignment methods focus primarily on training-time optimization, such as Reinforcement Learning from Human Feedback (RLHF) (Ouyang et al., 2022; Stiennon et al., 2020) and Direct Preference Optimization (DPO) (Rafailov et al., 2023). While these techniques significantly improve the alignment of model outputs, they still face certain limitations (Lin et al., 2023). RLHF, for instance, is sensitive to hyperparameters and is complex to train (Casper et al., 2023). DPO, on the other Yi Liu is the Corresponding author. 1Code can be found at: https://github.com/stevie1023/OPAD.git. Published as a conference paper at ICLR 2025 hand, introduces a new parameterization for the RLHF objective that simplifies and stabilizes the training process, but its performance is highly dependent on the quality of the preference pairs used (Pal et al., 2024). Despite their effectiveness, these alignment methods can be inefficient requiring substantial computational resources and impractical, given the pluralistic nature of human preferences. User preferences vary widely across different topics (Cheng et al., 2023), making it infeasible to curate data or train multiple models to handle customized or personalized applications in a single training phase. This limitation motivates the development of inference-time algorithms that can achieve efficient and on-the-fly alignment. Don't lose sleep over it. I'm not sleeping, I'm just taking a little nap on the cloud nine. Principle Prompting Respond in a sarcastic manner Think of a phrase or idiom containing the word "sleep" Don Figure 2: OPAD overview. Given user query x and principle c, OPAD computes a principle-guided reward rθ(x, y