# pad_personalized_alignment_of_llms_at_decodingtime__c7e7d606.pdf Published as a conference paper at ICLR 2025 PAD: PERSONALIZED ALIGNMENT OF LLMS AT DECODING-TIME Ruizhe Chen 1, Xiaotian Zhang 1, Meng Luo 2, Wenhao Chai 3, Zuozhu Liu 1, 1 Zhejiang University 2 National University of Singapore 3 University of Washington Aligning with personalized preferences, which vary significantly across cultural, educational, and political differences, poses a significant challenge due to the computational costs and data demands of traditional alignment methods. In response, this paper presents Personalized Alignment at Decoding-time (PAD), a novel framework designed to align LLM outputs with diverse personalized preferences during the inference phase, eliminating the need for additional training. By introducing a unique personalized reward modeling strategy, this framework decouples the text generation process from personalized preferences, facilitating the generation of generalizable token-level personalized rewards. The PAD algorithm leverages these rewards to guide the decoding process, dynamically tailoring the base model s predictions to personalized preferences. Extensive experimental results demonstrate that PAD not only outperforms existing training-based alignment methods in terms of aligning with diverse preferences but also shows significant generalizability to preferences unseen during training and scalability across different base models. This work advances the capability of LLMs to meet user needs in real-time applications, presenting a substantial step forward in personalized LLM alignment. Our model and code are available here. 1 INTRODUCTION Recent advancements have demonstrated success in aligning language models with human preferences and values (Stiennon et al., 2020; Bai et al., 2022; Ouyang et al., 2022; Achiam et al., 2023). Representative methods such as Reinforcement Learning from Human Feedback (RLHF) (Ouyang et al., 2022) and DPO (Rafailov et al., 2024b) typically optimize a policy model with training signals from an explicit or implicit reward model. The reward model captures the general human preferences or values from human feedback. However, in this pluralistic world, users preferences can diverge significantly based on their different cultures, educational backgrounds, religions, and political stands (Gordon et al., 2022; Sorensen et al., 2024a; Jang et al., 2023; Cheng et al., 2023). Furthermore, even for the same person, the preference of a particular LLM response can vary when the application scenario changes. Hence, there always exists a proportion of human preferences that cannot be unified by the general preference, also known as personalized preferences, which current alignment frameworks struggle to align with due to the need for high-quality datasets and substantial computational costs in policy optimization. How can we align with personalized preferences without the need for additional data collection and policy training? In this paper, we introduce Personalized Alignment at Decoding-time (PAD), which aims to align LLM outputs with diverse personalized preferences during the inference phase without requiring additional training. To achieve this, we first propose a personalized reward modeling strategy, which decouples the text generation process (modeled as a Markov Decision Process) from personalized preferences, thereby enabling the acquisition of generalizable token-level personalized rewards. Based on this, we then formulate a personalized reward model (Pers RM). During decoding, the Pers RM scores the base model s top-K predictions at each token generation step based on the current generation and personalized preferences. Finally, this score is combined with standard decoding likelihoods to adjust the base model s predictions. The advantages of PAD are as follows: (1) Corresponding Author Equal Contribution Published as a conference paper at ICLR 2025 Table 1: A checklist for key characteristics of previous methods and PAD. - : Not Applicable. Method Training-free General Policy Single Reward Generalizability MORLHF (Li et al., 2020) MODPO (Zhou et al., 2023) - Personalized soups (Jang et al., 2023) Preference Prompting (Jang et al., 2023) - Rewarded soups (Rame et al., 2024) Ri C (Yang et al., 2024b) DPA (Wang et al., 2024a) Args (Khanov et al., 2024) MOD (Shi et al., 2024) Meta Aligner (Yang et al., 2024a) - It requires only a single policy model (i.e., the base model) aligned with general preferences (General Policy), eliminating the need for training additional policy models (Training-free). (2) It utilizes only a single reward model (Single Reward). (3) It does not require pre-defined personalized preferences to generalize to preferences not seen during the training phase (Generalizability). A checklist of PAD s advantages over previous methods is presented in Table 1. Our contributions can be summarized as follows: We propose a novel personalized reward modeling strategy that decouples the dynamics of text generation from personalized preferences. This strategy enables the acquisition of generalizable token-level personalized rewards with a single personalized reward model (Pers RM). We propose a novel personalized alignment at decoding-time (PAD) algorithm that performs guided decoding with the guidance of token-level personalized rewards, while not requiring training additional policy models. Extensive experiments demonstrate that PAD outperforms existing training-based methods in aligning with diverse personalized preferences. Furthermore, the results highlight PAD s effectiveness in generalizing to unseen preferences and its model-agnostic scalability. 2 RELATED WORKS Large language model alignment. Large language model alignment aims to align LLMs with human preferences. A common approach involves using an RLHF (Reinforcement Learning with Human Feedback) framework (Christiano et al., 2017; Bai et al., 2022) where a reward model is trained based on human feedback, and Proximal Policy Optimization (PPO) (Schulman et al., 2017) is employed to derive the aligned policy model. Notably, decoding-time alignment offers an alignment paradigm that does not require expensive RL training (Mudgal et al., 2023; Khanov et al., 2024; Han et al., 2024; Liu et al., 2024; Huang et al., 2024). Controlled Decoding (CD) (Mudgal et al., 2023) utilizes a prefix scorer module trained to assess value functions for rewards, allowing controlled generation from a frozen base model. ARGS (Khanov et al., 2024) proposed using a reward signal to adjust probabilistic predictions, thereby generating semantically aligned texts. De AL (Huang et al., 2024) focusing on heuristic-guided searches to better meet diverse alignment objectives. Personalized alignment. As users exhibit diverse preferences and values for a single task, it is essential to align LLMs to personalized preferences (Kirk et al., 2023; Sorensen et al., 2023; 2024b; Yao et al., 2023; Kirk et al., 2024; Zhong et al., 2024; Han et al., 2024). One line of work achieves joint optimization for different personalized preferences by defining a reward function with multiple dimensions and performing policy optimization (Zhou et al., 2023; Wang et al., 2024a;b; Guo et al., 2024; Yang et al., 2024b; Chakraborty et al., 2024; Sun et al., 2024; Li et al., 2024). Additionally, some approaches merge model parameters or predictions for each dimension to handle their diverse combinations (Jang et al., 2023; Rame et al., 2024; Park et al., 2024; Shi et al., 2024). Lastly, promptbased methods align personalized preferences by designing diverse prompts or post-processing techniques (Yang et al., 2024a; Lee et al., 2024; Hwang et al., 2023; Jafari et al., 2024). Published as a conference paper at ICLR 2025 3.1 PRELIMINARIES In this section, we first define the per-token Markov Decision Process (MDP) for large language models (LLMs) and describe its relationship to classic Reinforcement Learning from Human Feedback (RLHF) approaches. Then, we describe the characteristics and challenges of personalized alignment. Text generation as token-level markov decision process. The standard text generation process of large language models (LLMs) with prompt x and response y can be defined as a token-level Markov Decision Process (MDP). MDP is denoted as a tuple M = (S, A, P, R, T), where the state space S consists of the prompt and all tokens generated so far (i.e., st = (x, y1:t 1)). The action space A is the tokens from the vocabulary (i.e., at = yt). P is the transition kernel, which is deterministic that given state st = (x, y1:t 1) and action at = yt, the next state is st+1 = (st, at) = (x, y1:t). R : S A R represents the reward at each step. The maximum token count, T, sets the length limit for LLM outputs, which conclude with an end-of-sentence (Eo S) token y T = Eo S that ends the generation. Given an MDP, the objective is to maximize the expected return R(x, y) = PT t=1 R(st, at). To achieve this, the agent computes a (Markov) policy π : S (A) that maps from state to a distribution over actions. The RLHF Pipeline. Classic RLHF approaches (Ziegler et al., 2019) first learn a reward function from human feedback on prompt and response pairs (x, yw, yl). The reward function is modeled under a contextual bandit setting using the Bradley-Terry preference model (Bradley & Terry, 1952): p (yw yl) = exp R(x, yw) exp R(x, yw) + exp R(x, yl), (1) where yw and yl denote the preferred and not-preferred completions for the prompt x. p (yw yl) denotes the probability that yw is preferred to yl. The reward model can be learned through Maximum Likelihood Estimation (MLE) on this dataset D: L(R, D) = E(x,yw,yl) D [log σ(R(x, yw) R(x, yl))] . (2) Subsequently, we use the learned reward model to provide feedback. The policy model (i.e., the language model) πθ is optimized with a gradient-based method such as PPO (Schulman et al., 2017) with an entropy-bonus using the following KL-constrained RL objective: max πθ Ey πθ(y|x) t=0 (R(x, y) βDKL(πref(y|x), πθ(y|x)) where πref represents a reference policy, typically the language model resulting from supervised fine-tuning, from which the learned policy should not significantly deviate. β is a parameter used to control this deviation. In practice, the language model policy πθ is initially set to πref. Additionally, it is important to note that we exclude the supervised fine-tuning (SFT) stage in the RLHF pipeline. This is because the SFT stage is not directly relevant to the focus of this paper. Personalized alignment within MDP Unlike the unidirectional reward in traditional MDPs, we posit that personalized preferences may vary across different dimensions; for example, some users may prefer concise and understandable responses, while others might favor comprehensive and expert answers. In this way, the reward function of personalized alignment can be defined as ˆR : S A Rn, which describes a vector of n rewards, one for each dimension of personalized preference (e.g., concise/comprehensive, expert/elementary), instead of a scalar. During alignment, a personalized preference may encompass one or several dimensions of rewards. Based on this, existing work (Li et al., 2020; Rame et al., 2024) employs a linear scalarization strategy, denoting human preferences as w such that R = w T ˆR. Subsequently, these approaches optimize the policy with RLHF objective or perform a weighted merging of multiple policies. However, these approaches still cannot overcome the high time and computational costs associated with optimizing the policy (i.e., the language model) for multiple personalized preferences simultaneously. Published as a conference paper at ICLR 2025 3.2 PERSONALIZED ALIGNMENT AT DECODING-TIME In this section, we propose Personalized Alignment at Decoding-Time (PAD), a novel approach that does not require training a policy and is transferable to diverse personalized preferences. Inspired by the concept of successor features (Dayan, 1993; Barreto et al., 2017), we first define the personalized reward function, which is composed of the features of the current state and personalized preferences. By linking this reward function to the value function, we can decouple personalized preferences from the dynamics of the MDP. Consequently, we only need to learn the generic features under the current policy, and by merely altering the personalized preferences, we can achieve personalized alignment. Finally, we introduce a guided decoding algorithm, which aligns with personalized preferences by utilizing the value function for weighting during the inference phase. Definition 1: personalized reward function R can be represented by: R(p, s, a) = w p ϕ(s, a), (4) where ϕ(s, a) Rd represents the features of current state and action (s, a), and wp Rd are weights derived from personalized preference p. Intuitively, ϕ(s, a) can be understood as the salient features (e.g., expert, informative) of the current state, and the vector wp models personalized preferences p as the degree of desirability for each feature. wp is independent of (s, a). Consequently, when the user s personalized preferences change, only wp is altered, which in turn modifies the reward function R(p, s, a). We derive the value function (Watkins & Dayan, 1992) of personalized reward function, inspired by that the optimal policy π obtained by Equation 3 can be formulated as (Ziebart, 2010; Rafailov et al., 2024a): π (at|st, p) = e(Q (p,st,at) V (p,st))/β, (5) where Q, the action-value function (i.e., Q function) based on the token-level reward Rπ(p, s, a), models the total future reward from (s, a) under policy π. V is the corresponding state-value function at current state, where V π(p, st) = β log R A e Qπ(p,st,at)/β da . Q and V denote the optimal value functions under optimal policy π . Q function can be expressed as: Qπ(p, st, at) = E[ i=t R(p, si, ai)|ai π( |si)], (6) i=t ϕ(si, ai)|ai π( |si)] = w p ψπ(st, at). (7) where E is the expectation over the randomness due to the sampling from the base language model π. Eq. 7 is derived by substituting Eq. 4 into Eq. 6. ψπ gives the expected sum of ϕ(s, a) when following policy π starting from (s, a), which is known as successor features (SFs) that also satisfy a Bellman equation of ϕ(s, a) (Bellman, 1966; Barreto et al., 2017). Therefore, it can be noticed that the vector wp representing personalized preferences is decoupled from the MDP dynamics. To obtain the Q function, we begin by rewriting Eq. 5 as in Rafailov et al. (2024b): R(p, x, y) = t=1 R(p, st, at) = t=1 βlog πθ(at|st, p) πref(at|st, p) + V (s1). (8) Denote logˆπ(a|s) as the features of (s, a), which satisfies logπ(a|s, p) = w p logˆπ(a|s). By substituting this relationship and Eq. 8 into Eq. 2, we can derive the loss function for personalized reward modeling: LPers RM(πθ, D) = E(x,yw,yl) D t=1 β log ˆπθ(aw t |sw t ) ˆπref(aw t |sw t ) t=1 β log ˆπθ(al t|sl t) ˆπref(al t|sl t)) (9) Inspired by Rafailov et al. (2024b), we can derive the implicit Q function Q (p, st, at) with optimized personalized reward model π θ: Q (p, st, at) = w p ψ (st, at) = w p β i=1 log ˆπ θ(ai|si) ˆπref(ai|si) + V (p, s1). (10) Then, we formulate personalized alignment as decoding-time Q guided search according to Eq. 5. Published as a conference paper at ICLR 2025 Preference p: Generate a response that is expert and informative. Instruction x: List the layers of the TCP/IP model. Current output y_t: It consists of 4 layers: 1. Application Layer: This Base Language Model Personalized Reward Model Weighted Scores Original LM Output: This layer handles high-level protocols and user interfaces, such as HTTP (Hyper Text Transfer Protocol)... layer topmost PAD Output: This topmost layer enables user-level applications to interact with the network. A common protocol is HTTP (Hypertext Transfer Protocol)... Figure 1: An illustration of the inference phase for personalized alignment at decoding-time (PAD) with optimized personalized reward model (Pers RM). Given the personalized preference and the current context, we first calculate the probability distribution of the base model for the next token. Then, we calculate the reward from Pers RM combining features of current state and personalized weight. Finally, the next token can be selected based on the weighted scores. Definition 2: Personalized alignment at decoding-time (PAD): The optimal policy π PAD of personalized alignment can be defined as selecting the action for the base model πLM that maximizes the advantage function Q (p, st, a) V (p, st) towards a personalized preference p at each step: π PAD(a|st, p) πLM(a|st)eβ(Q (p,st,a) V (p,st)), (11) where Q (p, st, a) V (p, st) is equivalent to w p β log(ˆπ θ(at|st)/ˆπref(at|st)) according to Eq. 10. The detailed derivations are provided in the appendix D.1. It can be observed that the learned reward function can serve as an optimal advantage function. Note that unlike RLHF framework that directly models the language model as policy, our PAD policy πPAD is different from the base language model πLM, as well as the personalized reward model πθ. 3.3 GENERALIZED PERSONALIZED ALIGNMENT In this section, we discuss the ability of PAD to transfer to unseen personalized preferences. Suppose now that we have computed the optimal value functions for n personalized preferences w1, w2, . . . , wn wϕ, denoted as {Q 1, Q 2, . . . , Q n}. Now, if the reward changes to R(pn+1, s, a) = w n+1ϕ(s, a), as long as we have wn+1 we can compute the new value function of π i by simply making Q n+1(s, a) = w n+1ψ (s, a). Once the functions Q n+1 have been computed, we can apply generalized policy improvement (Bellman, 1966; Barreto et al., 2017) to estimate its performance on wn+1. Theorem 1. Let wi Wϕ and let Q i be the action-value function of an optimal policy of wi. For all s S, a A, and j {1, 2, . . . , n}, let π(s) arg maxa maxi Q i (s, a). Finally, let ϕmax = maxs,a ϕ(s, a) , where is the norm induced by the inner product adopted. Then, Q n+1(pn+1, s, a) Qπ n+1(p, s, a) |H| ϕmax min j wn+1 wj . (12) This term is a multiple of the distance between wn+1 and the closest wj for which we have already computed a policy. The formula formalizes the intuition that if PAD has previously aligned to a similar personalized preference wj, it should align well on the new preference wn+1 (Barreto et al., 2017). The proofs of Theorem 1 are in the Appendix D.2. Published as a conference paper at ICLR 2025 3.4 PRACTICAL IMPLEMENTATIONS In this section, we introduce the practical implementation of our Personalized Alignment at Decodingtime (PAD), which includes the optimization of the personalized reward model (Pers RM) and the inference-time guided decoding with token-level personalized reward. Optimization As previously mentioned, we can decouple personalized preferences from the MDP process during the personalized alignment procedure. Thus, we consider utilizing an LLM-based model as a personalized reward model (Pers RM) πθ to independently predict the features ϕ(s, a) and preferences wp. For simplicity, we employ a single backbone to predict the embeddings for both ϕ(s, a) and p. πθ is initialized using the backbone, also referred to as the reference model πref. Subsequently, it employs an additional value head to predict wp. Optimization occurs in two stages with Equation 9. In the first stage, we fix wp as a unit vector and optimize the backbone to learn general features. In the second stage, we freeze the backbone and only optimize the value head for wp to learn different user preferences. The optimized Pers RM is denoted as π θ. Guided Decoding The inference phase of PAD is shown in Figure 1. Given the personalized preference p and the current context st = (x, y