# language_models_as_implicit_tree_search__9f0c07cb.pdf

Language Models as Implicit Tree Search

Ziliang Chen 1 Zhao-Rong Lai 2 Yufeng Yang 3 Liangda Fang 2 Zhanfu Yang 4 Liang Lin 1 3

Despite advancing language model (LM) alignment, direct preference optimization (DPO) falls short in LM reasoning with the free lunch from reinforcement learning (RL). As the breakthrough, this work proposes a new RL-free preference optimization method aiming to achieve DPO along with learning another LM, whose response generation policy holds the asymptotic equivalence with Alpha Zero-like search, the apex of algorithms for complex reasoning missions like chess Go. While circumventing explicit value and reward modeling, the neural implicit tree search executed by the extra LM remains seeking to equip DPO with reasoning procedure technically akin to Alpha Zero. Our experiments demonstrate that our methodology outperforms both regular DPO variants in human preference alignment, and MCTS-based LMs in mathematical reasoning and planning tasks.

1. Introduction

Preference optimization paradigms, notably Reinforcement Learning from Human Feedback (RLHF) (Bai et al., 2022) and Direct Preference Optimization (DPO) (Rafailov et al., 2024b), are foundational to modern Language Model (LM) alignment, aiming to imbue these models with human s behavioral characteristics, inclination, and value. In particular, DPO circumvent reward modeling so as to directly optimize LMs with preferential response pairs, thereby reinforcement learning (RL) s complexities and instability avoided to yield more efficiently and stably aligned models.

Distinct from preference alignment, sophisticated reasoning capabilities underlying human brains are not typically

1Research Institute of Multiple Agents and Embodied Intelligence, Peng Cheng Laboratory; 2Jinan University, Guangzhou, China; 3School of Computer Science and Engineering, Sun Yatsen University, Guangzhou, China; 4Department of Computer Science, Rutgers University. Correspondence to: Liang Lin <linlng@mail.sysu.edu.cn>.

Proceedings of the 42 nd International Conference on Machine Learning, Vancouver, Canada. PMLR 267, 2025. Copyright 2025 by the author(s).

replicated by RLHF and DPO alone: the emulation often necessitates synergistic integration with other techniques, i.e., meticulous prompt engineering (Wei et al., 2022), processbased reward modeling (Lightman et al., 2023), and Monte Carlo Tree Search (MCTS) (Xie et al., 2024a). Among these complementary approaches, MCTS stands out as essentially promising, due to its exploration of complex decision spaces in achieving high-level reasoning tasks such as game Go (Silver et al., 2017b). Massive leading research were proposed to combine MCTS with preference optimization, whereas their MCTS procedures inevitably require the value function learned by RL for executing the search principles of Upper Confidence Bound (UCB). It is noteworthy that the critical advantage of DPO is to skip the value learning to achieve faster and more stable preference optimization than RLHF. The standard MCTS s reliance on an RL-learned value function directly conflicts with DPO s core strength of bypassing RL entirely, showing a significant hurdle for their seamless integration. It implies the necessity to reconcile MCTS methodologies with the RL-free nature behind DPO.

Rather than the regular MCTS, this paper started from its theoretical variant derived from (Grill et al., 2020b), where Alpha Zero-like MCTS can be treated as a stochastic policy solved by the state-specific local optimization regularized by the reverse KL-divergence. With this regard, each statespecific local tree search decision asymtotically converges to Alpha Zero-like MCTS decision with the gap bounded by the empirical visit counts, thus, lifting the expressiveness of MCTS limited by the integer-count probability and sparse tree width. Regretfully, the local solution of this stochastic policy involves the parameter only solvable by dichotomic search per state. In terms of the exponentially complex state space in token-level language generation, this implicit tree search (ITS) is hardly applied in MCTS-based LM research.

Contributions. We propose a new RL-free preference optimization paradigm aiming to approximate ITS by LMs. Specifically,

Beyond the original token-selection policy in DPO, we propose another LM to learn the global policy of ITS with regards to neural universal approximation. The preference optimization built on top of DPO, without dichotomic search and any other RL elements, inherits the critical advantages of the DPO family.

Language Models as Implicit Tree Search

In terms of the gradient analysis to the preference optimization with ITS, learning its global policy yields the response generation with diverse and better aligned preference, therefore, simultaneously benefits AI alignment and reasoning tasks.

Self-improved preference policy augmentation and decoding strategy are proposed for our ITS-based preference optimization approaches, in order to align with the MCTS-based LLM decoding. The connection between the decoding strategies and group-relative policy optimization (Shao et al., 2024) has also been presented.

Our experiments included the evaluation across human preference alignment, mathematical reasoning, and mathematical planning, where our approach concurrently reaped the optima against DPO variants and MCTS-derived baselines.

2. Preliminaries

For language generation, language models (LM) θ serve as a token-level Markov decision process (MDP) (S,A,P,R,T ), where the state st = (x, y<t) in S consists of a prompt x followed by a sequence of response tokens x<t generated at the previous t-1 steps, and the action at = yt in A denotes the token selected at the current t step. Given this, the transition kernel P: (S,A) S holds the deterministic mapping from (st, at) to the next state st+1 = (x, y<t+1) with y<t+1 = (y<t, yt). The reward R: (S,A) R quantifies the preference of selecting tokens, and T T denotes the step to cease the generation when a T is the end-ofsequence (Eo S) token. The goal is to learn the sequential token-selection policy πθ that maximizes the accumulated reward R(x, y)=PT t=1 r(st, at), r R.

RLHF and DPO. Token-level MDP needs R(x, y) to reflects human preferences. RLHF (Bai et al., 2022) captures this from prompt x with its response pairs y(w) y(l), where identifies y(w) more preferable than y(l). The comparison is made by Bradley-Terry (BT) model (Bradley & Terry, 1952) in the contextual bandit setting

PBT(y(w) y(l)|x) = exp(R(x, y(w))) exp(R(x, y(w))) + exp(R(x, y(l))).

(1) where R(x, y) can be learned through

min R E(x,y(w),y(l)) D[log σ(R(x, y(w)) R(x, y(l)))]

(2) with the human preference dataset D. It serves RLHF as a KL-constraint RL objective

max πθ Ex D,y πθ( |x)[R(x, y) βDKL(πθ( |x) πref( |x))],

(3) where the reference policy πref is initialized by supervised fine-tuning (SFT) LM to learn the generation policy πθ via PPO (Schulman et al., 2017), and β > 0 controls their

trade off. DPO (Rafailov et al., 2024b) reconsiders the optimization from a interpretation of R(x, y) represented as R(x, y) = β log πθ(y|x)

πref(y|x) + β log Z(x). With such regards, Eq.2 is equivalent with

LDPO(πθ; πref) = E(x,y(w),y(l)) D log σ β log πθ(yw|x)

πref(yw|x) β log πθ(yl|x)

(4) therefore the reward modeling phase is skipped to directly optimize the token-selection policy πθ.

Token-level Interpretation of LM Alignments. The aforementioned LM alignments focus on a complete response y, yet πθ is executed in each MDP step using token-level rewards r(st, at) with β log πθ(yt|[x,y<t])

πref(yt|[x,,y<t])=β log πθ(at|st)

πref(at|st). (Rafailov et al., 2024a) re-formulates Eq.3 via the tokenlevel interpretation

max πθ Es0=x D,at πθ( |st)

h r(st, at) β log πθ(at|st)

πref(at|st)

max πθ Es0=x D,at πθ( |st)

h r(st, at)+β log πref(at|st)

+ βH(π) i . (5) So given r ME(st, at)=r(st, at)+β log πref(at|st), RLHF with the reframed reward is maximum-entropy RL (Ziebart, 2010) whose fixed point solution leads to the optimal policy π θ that satisfies π θ(at|st)=exp Q (st,at) V (st)

β with respect to V (st) = β log P

a A exp Q (st, at)/β . It is noteworthy that the optimal state-action value function holds the initial value Q (s0, a0) = R(x, y) = V (s0) + β PTy t=1 log π θ(at|st) πref(at|st) with V (s0) = V (x), henceforth we have the objective similarly derived from DPO

LT-DPO(πθ; πref) = E(x,y(w),y(l)) D log σ β

t=1 log πθ(aw t |sw t ) πref(aw t |sw t ) β

t=1 log πθ(al t|sl t) πref(al t|sl t)

(6) where aw t , al t indicate the t-th token of the responses yw, yl; and sw t = [x, yw <t], sl t = [x, yl <t], respectively.

MCTS and Alpha Zero-like Tree Search. Given the stateaction value treated as feedback from a contextual bandit, MCTS (Browne et al., 2012) builds an online search tree incrementally for decision making. (More details refers to (Silver et al., 2017b)).MCTS algorithms are famous in massive researches, among which the most successful case is Alpha Zero (Silver et al., 2017b) and its variants (Silver et al., 2017b; Schrittwieser et al., 2020). The MCTS variant underlying these methods remarked as Alpha Zero-like tree search, compute a policy at the root of the search tree improved from the distribution predicted by the prior policy

Language Models as Implicit Tree Search

π, which is updated by distilling back the tree-search policy to update the prior. The significant difference from other MCTS is Alpha Zero-like tree incorporating π as the prior during the search procedure. It leads to the heuristic search principle inspired from UCB:

a arg max a AMCTS

Qπ(s, a)+cπ(a|s)

a AMCTS n(s, a )

1 + n(s, a)

(7) where c > 0 is a balance factor between exploration and exploitation in the MCTS search space AMCTS. The fourstage procedure of Eq.7 is shown in Appendix.A.

3. Implicit Tree Search: Warm-Up

Alpha Zero-like MCTS variants have been intensively explored in leading researches for LM training (Feng et al., 2023; Zhang et al., 2024a) and alignment (Xie et al., 2024b; Chen et al., 2024a), in order to mitigate their downsides of reasoning and long-form generation ability captured in posttraining. In spite of different algorithm designs, the methodologies are consistent with MCTS in the data-generation manner: ˆπ merges with the decoding strategy to generate responses for training data augmentation, then the tokenselection policy πθ and visit counts updated along with these data, further renew the decoding by Eq.7. Despite significantly advancing LM alignments, the decision-making pipeline relies on the online state-action visit count n(s, a) per state-action pair. In terms of the integer-count probability and the sparse tree width in the initial training iters, it inevitably suffers from two critical problems.

Cold-start exploration expressiveness. P

a A n(s,a ) 1+n(s,a) is the key to differentiate ˆπ and π, so Alpha Zero-like tree search typically employs its empirical visit distribution 1+n(s,a) |AMCTS|+P

a AMCTS n(s,a ) or its exponential generalization

1+ n(s,a ) r

a AMCTS n(s,a ) r to explore the tree structure.

Whereas the discrete nature of visit counts limit the expressiveness of exploration strategy.

Proposition 3.1. Given a fixed r R/{0}, the action selection probability by the exponential empirical visit distribution holds its value with the equal cardinality of Z.

Proposition 3.2. Given a learnable r R, the action selection probability by the exponential empirical visit distribution holds the value in the range ( 1 |AMCTS|, 1).

The problem is cold-start because early decisions heavily influences the tree growth yet the ratios of integers are more unstable when the visit counts are small. It may drive the language generation into bias or sub-optima.

Sparse-action improvement. ˆπ solely improved for actions

searched at least once by the previous simulation whereas those with n(s, a)=0 would be dominated by θ. In terms of the large action space in language and the deterministic policy in Eq.7, it may cause a large simulation budget to improve actions never visited.

Recognizing that these limitations are intrinsic to the use of online visit counts in conventional MCTS algorithms, we supersedes the online tree-search strategy with stochastic policy optimization established in (Grill et al., 2020b) (depicted in Figure.1.b), named Implicit Tree Search since the optimized policy is provably equivalent with the Alpha Zerolike tree search exploration defined in Eq.7.

3.1. MCTS as Regularized Policy Optimization

More specifically, for any state, the action selection formula in Eq.7 holds the identical interpretation

a (s) arg max a AMCTS

Qπ(s, a) + λN(s) π(a|s)

where ˆπ(a|s) 1+n(s,a) |AMCTS|+P

a AMCTS n(s,a ) represents

the empirical visit distribution and λN(s) c q P

a AMCTS n(s,a )

a AMCTS n(s,a ) denotes the state-specific multi-

plier. Notice that for any s S, the empirical visit distribution ˆπ( |s) is the only way that search algorithm influences the optimal action of tree search. In language generation context, we set π=πθ, so for any s S, ˆπ( |s) holds a corresponding predominant policy π( |s) as

Theorem 3.3. (Asymptotic equivalence between ITS and MCTS policies) s S, let π( |s) be the solution of

π( |s) arg max y(s) S

h Qπθ(s) y(s)

λN(s)DKL[πθ( |s), y(s) i (9)

where S denotes the |AMCTS|-dimensional simplex, and Qπθ(s) = Qπθ(s, a1), Qπθ(s, a2), , Qπθ(s, a|AMCTS|) is the Q-value vector with respect to the state s. Then as the visit counts increase, the empirical visit distribution ˆπ( |s) in Eq.8 converges to π( |s) with the upper bound

ˆπ( |s) π( |s) |AMCTS| + 1

|AMCTS| + N (10)

where N indicates the total rounds of simulation.

Theorem.3.3 is concluded from the definition.1 and proposition.1,5 in (Grill et al., 2020b). The theoretical result verifies that, given sufficient simulation steps, there exists an asymptotic equivalence between the exploration of Alpha Zero-like tree search ˆπ( |s) and the stochastic-sampling policy π( |s). Note that the policy optimization in a simplex allows the

Language Models as Implicit Tree Search

Figure 1. Comparison between the diagrams of (a).MCTS (regular MCTS and Alpha Zero-like tree search), (b). Implicit Tree Search (ITS) (Grill et al., 2020b), and our IT-PO algorithm.

continuous value change in π( |s) that prevents the risk of cold-start expressiveness. Besides, the reversed KL divergence is smooth on π( |s), ensuring the sparse-action improvement resolved.

Accordingly, the expand and backup stages along with the MCTS procedure require the state-specific stochastic policy optimization formulated as follows

Lemma 3.4. (Solution of ITS policy (Grill et al., 2020b)) s S, the solution π( |s) of Eq.9 holds

at AMCTS, π(at|st) = λN(st) πθ(at|st) α(st) Qπθ(st, at) (11) where α(st) is defined as

1).α(st) max

α R, s.t. X

a AMCTS π(a|st) = 1

2).α(st) α(st)min max a AMCTS

Q(st, a) + λNπθ(a|st)

3).α(st) α(st)max max a AMCTS Q(st, a) + λN (12)

Lemma.3.4 is derived from Appendix.B.3 in the paper.

3.2. Challenges of Combing Implicit Tree and LMs

Observed that P

a AMCTSπ(a|st) monotonically decreases on (α(st)min,α(st)max), the theoretical result guarantees the state-specific hyper-parameter α(st) uniquely identified using dichotomic search over (α(st)min, α(st)max). In other words, π can not be universally optimized by gradient descent and even for each state, π( |st) s solution in Eq.11 is not exactly closed-form. To this end, π can not be flexibly used as explicit search like regular MCTS algorithms and instead, only applicable for policy distillation to update the Q function in Alpha Zero-like MCTS. However, it is hardly applied for LLM-based preference optimization.

The vital problem roots in the exponentially increase size of the valid state space. As demonstrated in token-level MDP, texts are generated by sequential token selection with the state st = (x, y<t) deterministically transmitted as P(st, at) st+1 = (x, y<t+1). It implies that for each prompt x, the valid state space holds the size |V |t (|V | indicates the size of token vocabulary V ) exponentially increase with t. Given such challenge, Lemma.3.4 solely promises the local policy so that the actor-critic modeling is required to obtain πθ and Qπθ simultaneously, as leading LLM-based MCTS approaches do. What s worse, the local policy also relies on α(st) inferred by dichotomic search per state, implying the computation and buffer also exponentially increasing. This problem even more severe when the action space AMCTS become sentence-level.

4. Preference Optimization with Implicit Tree Search

In the previous section, we demonstrated the advantages of π over the empirical visit exploration employed by regular MCTS methods, while its solution of Lemma.3.4 is hardly implemented in the LM realm due to the state-specific nondifferentiable hyper-parameters α(st). In this section, we derive the new RL-free preference optimization paradigm to approximate π without either Q function or α(st), yielding the response generation as MCTS without value modeling.

4.1. Language Models as ITS Policy

Specifically, we chase for the universal approximator of π based on the extra policy network πφ parameterized by the other language model φ (Figure.1.c). It supports the instant inference of π( |st) (i.e., no dichotomic search required to obtain α(st)), no matter st was visited or not.

Thought consistent with πθ in the reversed KL constraint, π is essentially a stochastic strategy generalized from deterministic policy in the Alpha Zero-like tree search (Theo-

Language Models as Implicit Tree Search

rem.3.3). It motivates us to initialize its universal approximator πφ by supervised fine-tuning. Specifically, we fine-tune a pre-trained LM to obtain πsft, whose parameters initialize the reference policy πref, the token selection policy πθ, and our ITS policy approximator πφ. Then we applied the DPO variant algorithm, i.e., conservative DPO (c DPO) (Mitchell, 2023b) to update πθ:

Lc DPO(θ) = E(x,y(w),y(l)) D

[(1 ϵ) log[σ(uθ(x, y(w), y(l)))]+ϵ log[σ(uθ(x, y(l), y(w)))] (13) where u( ) indicates the preference logit derived from BT model, ϵ [0, 0.2) denotes the margin then when ϵ = 0, c DPO degenerates into DPO.

After training with minφ Lc DPO(φ; θ), the parameters of πθ, πref are frozen to update the parameter of πφ. This manner provides the stable πθ to derive the optimization of φ, i.e., Implicit-Tree Preference Optimization (IT-PO).

4.2. Implicit Tree Preference Optimization (IT-PO)

IT-PO seeks for learning πφ to substitute π without either reward or value functions. It is noteworthy that π( |st) inherently contains Qπθ( , st) from Lemma.3.4, which serves as the implicit policy evaluation for π with the lower values implying increased exploration and vice versa. Therefore if πφ approximate π well enough, ITS can implicitly derive the value function from πφ without explicit modeling Qπθ.

For simplicity, we first consider step-synchronous ITS with the policy πθ, wherein the MCTS action node is token-level, i.e., AMCTS = A. Specifically, given fixed θ and rounds of simulation (i.e., λN(st) is also fixed), the ITS policy π( |st), st S can be uniquely determined according to Lemma.3.4. Suppose that φ achieve the optimal parameter such that πφ(at|st) = π(at|st), it holds the observation

Lemma 4.1. Suppose (x, y(w)) denotes a pair of prompt and its preferred response. For each state s(w) t = (x, y(w) <t ), the action selects either the preferred token a(w) t = y(w) t or a random token ˆat sampled from πφ ( |s(w) t ); and each state s(l) t = (x, y(l) <t), the action selects either the dispreferred token a(l) t = y(l) t or a random token ˇat sampled from πφ ( |s(l) t ). It holds

R(x, y(w)) =

r(s(w) t , ˆat) λN(s(w) t ) πθ(a(w) t |s(w) t )

πφ(a(w) t |s(w) t ) πθ(ˆat|s(w) t )

πφ(ˆat|s(w) t )

R(x, y(l)) =

r(s(l) t , ˇat) λN(s(l) t ) πθ(a(l) t |s(l) t )

πφ(a(l) t |s(l) t ) πθ(ˇat|s(l) t )

πφ(ˇat|s(l) t )

The lemma demonstrates that we may construct the accumulated reward of preference prompt-response pair by the ITS exploration starts from each state of the preferred and dispreferred sequences (i.e., root or intermediate node). As demonstrated in (Rafailov et al., 2024a), DPO-like methods hold the dense reward re-parameterized as r(st, at) = β log πθ(at|st) β log πref(at|st). Combine this with Lemma.4.1 leading to Theorem 4.2. (Step-Synchronous IT-PO) A prompt x drawn from D has a response pairs y(w) y(l). Given ˆ A={ˆat πφ( |s(w) t )}Tw t=1 and ˇ A={ˇat πφ( |s(l) t )}Tl t=1, it holds

Uss(x, y(w), y(l)) = R(x, y(w)) R(x, y(l)) =µw(φ, θ) µl(φ, θ) + δ(θ),

s.t. µw(φ, θ)=

t=1 λN(s(w) t ) πθ(a(w) t |s(w) t )

πφ(a(w) t |s(w) t ) πθ(ˆat|s(w) t )

πφ(ˆat|s(w) t )

t=1 λN(s(l) t ) πθ(a(l) t |s(l) t )

πφ(a(l) t |s(l) t ) πθ(ˇat|s(l) t )

πφ(ˇat|s(l) t )

t=1 log πθ(a(w) t |s(w) t )

πref(a(w) t |s(w) t )

t=1 log πθ(a(l) t |s(l) t )

πref(a(l) t |s(l) t )

(15) then the step-synchronous IT-PO is proposed by

Lss-IT-PO(φ; θ) = E(x,y(w),y(l)) D, ˆ A, ˇ A πφ

[(1 ϵ) log σ(Uss(x, y(w), y(l)))+ϵ log σ(Uss(x, y(l), y(w)))] (16)

Theorem 4.2 establishes that optimal approximation between π and πφ must preserve their stochastic search strategies conformity to the BT preference model. The minimization of Eq.13 is also derived from c DPO that incorporates the margin to tolerate the preference noises possibly introduced by ˆ A, ˇ A drawn from πφ. It benefits πφ to generate the high-quality responses that exceed the old data.

Step-Asynchronous ITS. MCTS algorithms significantly improve the LM s reasoning ability where the action space of each node focuses on sentence. It holds the asynchronous step between π and πθ, i.e., AMCTS = SN n=1 An, where An is the sentence-level action space with n-length tokens and N indicates the maximum length of sentences. Observe that the token-level MDP could be also identified into the sentence level, where π, πθ, πref take their sentence-level actions by generating token sequences ended by [\n] token (Definition B.3 in Appendix). We re-frame their sentencelevel policy as πS φ, πS θ , πS ref so that given the state st SS and the sentence-level action At = (a(i) t )Nt i=1 AS (Nt indicates the token number in the t-th sentence action and a(i) t indicates the i-th token in the t-th sentence), and the sentence-level reward r S(st, At) = PNt i=1 r(st, a(i) t ).

Given this, we derive the sentence-level ITS optimized by step-asynchronous variant of Eq.4.2:

Language Models as Implicit Tree Search

Theorem 4.3. (Step-Asynchronous-IT-PO) A prompt x drawn from D has a response pairs y(w) y(l) composed of T S w, T S l sentences, respectively. A(w) t , A(l) t denotes the t-th sentence in y(w), y(l), respectively. Suppose the state s(w) t ,s(l) t transmits along sentence-level MDP (Definition.B.3), a(w,i) t /a(l,i) t indicates the i-th token in the A(w) t /A(l) t , s(w,i) t /s(l,i) t denotes the sequential context ahead of a(w,i) t /a(l,i) t in y(w)/y(l). Suppose that the sentencelevel policies πS ref, πS θ , πS φ are identified by LM-based token-selection policies πref, πθ, πφ, respectively; ˆAt =

(ˆa(1) t , , ˆa( ˆ Nt) t ) πS φ( |s(w) t ),ˆs(0) t = s(w) t ,ˆs(i+1) t =

P(ˆs(i) t , ˆa(i) t ); ˇAt = (ˇa(1) t , , ˇa( ˇ Nt) t ) πS φ( |s(l) t ),ˇs(0) t =

s(l) t ,ˇs(i+1) t = P(ˇs(i) t , ˇa(i) t ). It holds

Usa(x, y(w), y(l)) = R(x, y(w)) R(x, y(l))

=µS w(φ, θ) µS l (φ, θ) + δS(θ),

s.t. µS w(φ, θ)=

t=1 λN (s(w) t ) |A(w) t | Y

πθ(a(w,i) t |s(w,i) t )

πφ(a(w,i) t |s(w,i) t )

πθ(ˆa(i) t |ˆs(i) t )

πφ(ˆa(i) t |ˆs(i) t )

µS l (φ, θ)=

t=1 λN (s(l) t ) |A(l) t | Y

πθ(a(l,i) t |s(l,i) t )

πφ(a(l,i) t |s(l,i) t )

πθ(ˇa(i) t |ˇs(i) t )

πφ(ˇa(i) t |ˇs(i) t )

δS(θ) = β T S w X

i=1 log πθ(ˆa(i) t |ˆs(i) t )

πref (ˆa(i) t |ˆs(i) t )

i=1 log πθ(ˇa(i) t |ˇs(i) t )

πref (ˇa(i) t |ˇs(i) t )

(17) then the step-synchronous IT-PO is proposed by

Lsa-IT-PO(φ; θ) = E (x,y(w),y(l)) D,{ ˆ At} T S l t=1,{ ˇ At} T S w t=1 πφ

[(1 ϵ) log σ(Uss(x, y(w), y(l)))+ϵ log σ(Uss(x, y(l), y(w)))] (18)

4.3. Self-Improved Training and Decoding In Sec.4.1, 4.2, LM φ serves as a universal approximator of the ITS policy π across the state space. Since π behaves as the stochastic variant of Alpha Zero-like tree search, we turn to discuss their common purpose, i.e., distilling πφ back into the policy πθ to enhance both their performances.

Preference Policy Augmentation. Regular MCTS algorithms used to sample decision-making trajectory to refine the original policy πθ, which is recently treated as a type of generalized policy improvement. Instead, (Grill et al., 2020b) used to distill their ITS policy by DKL(π, πθ). It eases the sampling process but also sacrifices the benefits derived from the high-quality response trajectories.

With this regards, we propose preference policy augmentation by resampling M responses {yk}K k=1 generated from each prompt x D by πφ. {yk}K k=1 are then constructed into pairs with their preference relations evaluated by Uss or Uas, thus, (yi yj) holds if either Uss(x, yi, yj) > 0 at the token level or Uas(x, yi, yj) > 0 at the sentence level. After ranking their value by Uss or Uas, the top-M preference pairs are selected for each prompt x in D. Then we collect them across all prompts in D to construct the ITS

improved dataset D+. They join with D to refine the tokenselection policy via learning θ with c DPO. We elaborate the IT-PO algorithm pipelines of its step-synchronous and stepasynchronous cases in our implementation in Appendix.C.

Self-Improvement in Gradient Analysis. Through analyzing the TI-PO loss function s gradients with respect to φ, we demonstrate how πφ facilitates the generated responses both diverse and better aligned with ground-truth preferences. For clarity, our analysis is derived from the stepsynchronous ITS using Uss for preference alignment and focuses on the positive preference logit in log[σ(Uss)]:

log[σ(Uss(x, y(w))), y(l)]

=( φµw φµl) R log σ( R)

| {z } higher when reward estimate is wrong by φ

t=1 λN(s(w) t ) πθ(a(w) t |s(w) t ) φ log[πφ(a(w) t |s(w) t )]

πφ(a(w) t |s(w) t )

πθ(ˆat|s(w) t ) φ log[πφ(ˆat|s(w) t )]

πφ(ˆat|s(w) t )

t=1 λN(s(l) t ) πθ(a(l) t |s(l) t ) φ log[πφ(a(l) t |s(l) t )]

πφ(a(l) t |s(l) t )

+ πθ(ˇat|s(l) t ) φ log[πφ(ˇat|s(l) t )]

πφ(ˇat|s(l) t )

(19) Compared with DPO, the second multiplier observed from the decomposed formula achieves the identical gradient effects. φ log[πφ(a(w) t |s(w) t )] and φ log[πφ(a(l) t |s(l) t )] similarly exist in the first term yet they take the likelihood

increase and decrease effects with coefficients πθ(a(w) t |s(w) t )

πφ(a(w) t |s(w) t )

and πθ(a(l) t |s(l) t )

πφ(a(l) t |s(l) t ). It implies that when πφ <πθ to the ground-

truth preference pairs, IT-PO takes the aggressive likelihood influence to align πφ and πθ while the conservative likelihood influence if πφ >πθ. More importantly, the implicit tree sampled from the ground-truth preference nodes, i.e., ˆat πφ( |s(w) t ), ˇat πφ( |s(l) t ) also influence the likelihood: φ log[πφ(ˆat|s(w) t )] and φ log[πφ(ˇat|s(l) t )] demonstrates that on account of the ground-truth preference nodes, πφ learns to search lower preference child nodes from s(w) t yet higher preference child nodes from s(l) t with the same conservative-aggressive strategy. Hence πφ generates certain responses from the preferred context yet explore more diverse responses when the context is not preferred.

ITS-guided Decoding. Beyond self-enhanced training, recent study on LLM-based MCTS (Feng et al., 2023) demonstrates promising results in improving the decoding process by MCTS. Motivated from this, we propose the stochastictree variant decoding from the spirits of their MCTS-α and MCTS-rollout strategies.

1). ITS-α. For each initial state xroot, the original MCTS-α

Language Models as Implicit Tree Search

decoding strategy applied Alpha-like tree search for policy evaluation then backup the visit count n(s, a) of the expo-

nential visit distribution n(st,a)1/γ P

a n(st,a )1/γ to guide decoding. Since πφ is the universal approximator of π, the stochastic variant of Alpha-like tree search. It is straightfoward to use the exponential version of πφ, exp(log[πφ(a|st)]/γ) P

a exp(log[πφ(a |st)]/γ)

(step-synchronous IT-PO) or exp(log[πS φ(a|st)]/γ) P

a exp(log[πS φ(a |st)]/γ) (stepasynchronous IT-PO) in our decoding strategy.

2). ITS-rollout. When the root xroot and the intermediate state nodes sufficiently differ from the states visited in training, the simulation rounds N behaves more closely as zero to fail the approximation between ˆπ and π in Theorem.3.3. In this case, πφ need to be updated to adapt the state xroot as the backup process in MCTS. Due to no value function available, we employ πθ to implicitly evaluate arbitrary pairs of responses y1,y2 generated by πφ( |x), then the sampled responses with preferences evaluated by θ would join the meta-update of πφ to the state x

φ φ (x,y1 y2) φ Lsa-IT-PO s.t. y1 y2 if uθ(x, y1, y2) > 0, (20)

which refreshes exp(log[πφ (a|st)]/γ) P

a exp(log[πφ (a |st)]/γ) to facilitate our decoding strategy. ITS-rollout can be treated as test-timetraining version of ITS-α. Its implementation first updates φ by (20), then use ITS-α with φ to facilitate the decoding process. In our experiment, 8 responses for each test prompt were generated to achieve self-training, resulting the extra 0.5 1 hour as the decoding warm-up stage.

Provided the decoding probabilities obtained by ITS-rollout, the BF-S strategy (Breath-first Search) can be applied based on the implicit advantage function solely approximated by πφ and πθ. More specifically,

Aπθ(a, s) =Qπθ(a, s) E[Qπθ(a , s)|a πθ( |s)]

Qπθ(a, s) 1

a πθ( |s) Qπθ(a , s)

= α(s) λN(s)πθ(a|s)

α(s) λN(s)πθ(a |s)

= λN(s) πθ(a|s)

λN(s) πθ(a|s)

πθ(a |s) πφ(a |s)

(21) where N indicates how many responses drawn from πθ to approximate the advantage function. Such decoding strategy is closely related with wisdom of group-relative policy optimization (GRPO), a well-known value-free RL algorithm employed to train Deepseek (Shao et al., 2024).

Figure 2. Win rate measured by GPT-4 via the consistent prompts in previous studies: (a) Win rate of baselines decoding with different temperatures; (b) Win rate of πθ and πφ across the alternative phases of their preference policy distillation.

5. Experiments

In this section, we demonstrate the superiority of IT-PO from the step-synchronous (Theorem.4.2) and step-asynchronous (Theorem.4.3) perspectives. In the step-synchronous cases, IT-PO trains πφ to perform token-level ITS in order to provide the fine-grained human preference alignment; in the step-asynchronous scenarios, we evaluate πφ trained by ITPO to perform sentence-level search in step-asynchronous scenarios, i.e., mathematical reasoning and planning tasks where LLM-based MCTS algorithms are extensively used.

5.1. Experiments on Token-level Preference Alignment

Anthropic HH dataset (Bai et al., 2022)1 consists of 170k dialogues between a human and an automated assistant, each of which presents as a history with alternative responses with respect to different preferences annotated by humans. We conduct the conventional evaluation setup using Pythia 2.8 (Biderman et al., 2023) as the base model, then each dialogue with its preferred completion in Anthropic HH training set is incorporated for supervised fine-tuning to derive three LMs: the parameter-frozen πref and parameterinitialized πθ, πφ. In terms of preference alignment on token-level MDP, SS-IT-PO with respect to the LMs θ, φ are alternatively trained, then the IT-PO variants both compared with token-level DPO-family baselines, i.e., DPO (Rafailov et al., 2024a), RTO (Zhong et al., 2024a), TDPO1 and TDPO2 (Zeng et al., 2024a).

The experiment primarily aims for three evaluation metrics: 1). Accuracy: we adopt the evaluation split in (Zeng et al.,

2024a) to train all models then evaluate their performance

1https://huggingface.co/datasets/Anthropic/hh-rlhf

Language Models as Implicit Tree Search

Table 1. Comparison in terms of the trade-off between Alignment (accuracy) and Diversity (entropy) on the Anthropic HH dataset. The indicates higher values are preferable.

Method Alignment Diversity

Acc (%) Ent

DPO 59.43 3.196

RTO 61.43 3.314

TDPO1 60.08 4.727

TDPO2 67.33 4.915

IT-PO θ (ours) 67.75 4.564

IT-PO φ (ours) 69.12 5.315

in terms of the accuracy on the generated responses relative to chosen completions in the test dataset; 2). Diversity: Nucleus sampling with p = 0.95 to generate 25 responses then the predictive entropy across the responses indicates the generation diversity. 3).Win rate: all baseline approaches are evaluated through GPT-4 against the chosen responses in the test set, so that > 50% implies the human preference alignment achieved in their performances.

Both LMs θ and φ utilize ITS-α to enhance the decoding strategy. The exponential rate γ =1 consistently across the evaluation of three metrics. This choice aligns with MCTSα, which employs token-level tree search in alignment tasks. Accuracy and Diversity across the baselines are presented in Table 5.1. TDPO2 demonstrates competitive performance, achieving alignment accuracy comparable to IT-PO θ, even exhibiting superior diversity. On the other hand, IT-PO φ, trained as the stochastic tree policy, outperforms TDPO2 by a significant margin in both alignment and diversity metrics. It is because that LM θ in our framework performs more likely as a policy evaluator trained to boost the evolution of stochastic tree policy πφ.

To verify our assumption, we further observe their Win rate illustrated in Figure.2(a). Through scaling the temperature during inference, T-DPOs are found quite sensitive to the temperature, with lower performance than most baselines when the temperature value are extreme. Instead, both ITPO θ and IT-PO φ consistently performs with robust value over 50% in 7 out of 10 cases where the other baselines underperform or even fail in the qualified Win rate level. In Figure.2 (b), we further observe the Win rate progress when IT-PO θ and IT-PO φ by through their alternative policy distillation. When no alternative strategy used (iter = 1), the win rate of IT-PO θ and IT-PO φ are solely on par with the win rates of DPO and RTO, respectively, which largely underperform T-DPO variants. While after five iterations of alternative preference policy distillation, Win rates in ITPO θ and IT-PO φ are both incredibly improved to exceed their early, even their teachers performances. These observations are consistent with our analysis to φ s gradients.

Table 2. Performance comparison of different methods on GSM8k and Game24 datasets

Setting Method Performance(%) / # Tokens

GSM8k Game24

Co T-greedy 41.4 98 12.7 76 BFS-V 52.5 485 64.8 369 MCTS-α 51.9 561 63.3 412 MCTS-Rollout 47.8 3.4k 71.3 670 ITS-α (ours) 53.2 561 67.6 380 ITS-Rollout (ours) 51.6 3.8k 73.2 646

Equal-Token

Co T-SCMAJ 46.8 500 14.6 684 Co T-SCORM 52.3 500 50.6 684 BFS-VORM - - 70.90 1.6k MCTSORM - - 69.34 649 ITS-α (ours) - - 70.64 1.6k ITS-Rollout (ours) - - 71.42 698

5.2. Experiments on Mathematical Tasks

Our second experimental suite includes the tasks of mathematical reasoning on GSM8K (Cobbe et al., 2021), MATH (Hendrycks et al., 2021) and mathematical planning on Game24 (Cobbe et al., 2021). Although the benchmarks are famous in evaluating LLM s reasoning capability by PRM (process reward model) (Lightman et al., 2023) and MCTSbased methods (Feng et al., 2023), they are nontrivial for DPO family since explicit value function can not be skipped to execute MCTS reasoning strategies.

Baselines and Evaluation. Since most DPO-based approaches can not adapt to LLM reasoning without explicit value modeling, we are more interested in the comparison between TI-PO and state-of-the-art tree-search (TS) LLM and Chain-of-Thought (Co T) (Wei et al., 2022) baselines in (Feng et al., 2023). More specifically, the evaluated baselines for GSM8K and Game24 are derived from LLAMA-7b 2 base model, specifically include Co T-greedy (greedy value search by Co T), BFV (Breath-first search with their learned value function), MCTS-α (Alpha Zero-like tree search with their learned value function), MCTS-rollout (MCTS-α variant that allows the backup process happen in the intermediate step). As a counterpart, we incorporate the policies πθ and πφ with their LMs fine-tuned by c DPO (Eq.13) and step-asynchronous IT-PO objective (Eq.17), respectively, then achieve their alternative post-training via preference policy augmentation. After five iterations, the policy models jointly facilitate the decoding processes via ITS-α and ITSrollout strategies. All baselines are evaluated in the single path setup via Path@1 and Equal-Token, the latter try to

2https://huggingface.co/meta-llama/Llama-2-7b

Language Models as Implicit Tree Search

Table 3. Path@1 metric on Game24 with different node size.

Method Performance(%) / # Tokens

width=6 width=20 width=50

MCTS-α 41.6 243 63.3 412 74.5 573

MCTS-Rollout 43.8 401 71.3 670 80.7 833

BFS-V 43.2 206 64.8 370 74.6 528

ITS-α (ours) 46.1 267 67.6 380 75.0 647

ITS-Rollout (ours) 48.9 489 73.2 646 81.8 954

compare the results with the similar scale computation consumption. For the evaluation on MATH, we construct the training set integrated with the training splits of GSM8K and MATH, then consider Qwen1.5-32B (Team, 2024) as our base model. Beyond this, we also introduced LLAMA3.18B as an alternative of φ to verify whether φ can be replaced by a smaller LLM. To this, we have the evaluated LLMs (θ,φ) defined by (Qwen1.5-32B,Qwen1.5-32B) and (Qwen1.5-32B,LLAMA3-8B), respectively. We also employed greedy Co T (3 shots), MCTS-α, and MCTS-rollout for their comparison.

Implementation. Distinct from human preference data with pairwise responses, reasoning tasks only consist of question prompts and its correct solution responses. It motivates us to reconfigure their data to adapt our preference optimization regime. Beyond this, since our tree-search strategy is indeed a stochastic policy, the search depth and breadth for training LLM φ were not limited in all datasets, while their decoding procedures i.e., ITS-α and ITS-rollout, inherited the pruning setup derived from the MCTS baselines to address the heavy computation while the tree search runs for reasoning-oriented inference. Details refer to Appendix.C.

Results.According to the GSM8k results (Table 2), when examining Path@1 performance across all baseline methods, tree-search algorithms (excluding our proposed methods) don t show significant advantages over standard Co T approaches. MCTS variants actually perform worse than BFV, despite their higher computational requirements. However, the Game24 results tell a different story. In this task, Co T approaches perform poorly, largely because Game24 s structure allows for wider and deeper search trees, which better showcases the strengths of tree-search algorithms (Table B.5). But interestingly, ITS-α and ITS-rollout consistently outperform other approaches, regardless of the default tree width and depth limits during inference.

This superior performance can be attributed to our IT-PO training regime. Unlike traditional approaches that slowly build sparse trees based on visit counts, our approximated π encourages broader exploration across the entire action space. This approach is computationally efficient since ITPO only needs to sample the next sentence (rather than the

Table 4. The experimental results in MATH. All baselines employed Qwen1.5-32B as their base models.

Baselines/Decoding greedy Co T -α -rollout

Qwen1.5-32B 36.1 - -

MCTS- (Qwen1.5-32B) - 36.0 36.7

ITS- (φ=Qwen1.5-32B) - 39.8 40.2

ITS- (φ=Qwen1.5-32B) - 37.9 38.2

complete reasoning path) from preferred/dispreferred contexts (as defined by ˆAt and ˇAt in Eq. 4.3). This enables faster search and backup operations compared to standard MCTS. As a result, even with bounded search width and depth during inference, the stochastic policy demonstrates robust performance due to effective training simulation. In order to support our claim, we further ablate the maximum tree width and node size during inference to observe the performance variation across different baselines. As shown in Table.3,4, expanding the node size and tree width significantly boosts regular TS-LLM performances. While ITS-α and ITS-Rollout have already achieved impressive results with the small node size and and tree width. Beyond this, they also enjoy the performance growth along with the increasing exploration space.

In Table.5, we presented the evaluation in MATH that contains more difficult mathematical reasoning tasks. In terms of the greedy Co T results in LLAMA3-8B (20.5), some observations can be found in the table. First, we note that the MCTS strategy derived from GSM8K is unreliable in MATH, probably due to the conflict between the problem complexity and limited search width and depth during training, while ITS-α and ITS-rollout did not suffer from this problem. Second, with a weaker model (LLAMA3-8B) to implement LLM φ, IT-PO still enabled the improvement of LLM-based reasoning. It implies the feasibility of ITPO. Specifically, despite avoiding value modeling, IT-PO introduces the extra LM φ instead of the single LM θ in standard DPO variants, leading to the increased computational and memory requirement. While the evidences in Table.5 demonstrated that using LM φ with the significantly smaller and weaker base model than LM θ, ITS-(θ =Qwen1.5-32B, φ = LLAMA3-8B) still yields the results very competitive with ITS-(θ = Qwen1.5-32B, φ = Qwen1.5-32B).

6. Concluding Remark

This paper presents a new RL-free methodology to equip DPO with MCTS interpreted as stochastic policy to better align LMs with human preferences, and simultaneously outperforms MCTS-based LLM reasoning baseline methods in both mathematical reasoning and planning tasks.

Language Models as Implicit Tree Search

Acknowledgement

The research was supported in part by Guangdong S&T Programme (Grant No. 2024B0101010003); in part by The Major Key Project of PCL (No. PCL2024A04, PCL2025A02); in part by National Natural Science Foundation of China (NSFC) under Grant No.62206110, 62176103, 62377208, and 62276114; in part by the Science and Technology Planning Project of Guangzhou under grants 2024A04J9896, 2025A03J3565.

In particular, we thank all the reviewers for their constructive suggestions that help to improve this work.

Impact Statement

This work advances preference-based learning for large language models through improved exploration mechanism, which has implications for AI alignment and safer deployment of language technologies. By enhancing LLMs reasoning and long-form generation capabilities through more principled exploration and preference learning, our approach could lead to more reliable and controllable language models. However, improved reasoning capabilities could also enable more sophisticated text generation that may be misused. The integration of MCTS with preference learning represents a step toward more transparent optimization of language model behavior, though careful consideration should be given to the selection of preference data to avoid encoding harmful biases. We believe the technical advances presented here can contribute positively to the development of more capable and aligned language models when deployed thoughtfully with appropriate safeguards and oversight.

Abdulhai, M., White, I., Snell, C., Sun, C., Hong, J., Zhai, Y., Xu, K., and Levine, S. Lmrl gym: Benchmarks for multi-turn reinforcement learning with language models. ar Xiv preprint ar Xiv:2311.18232, 2023.

Amini, A., Vieira, T., and Cotterell, R. Direct preference optimization with an offset, 2024. URL https: //arxiv.org/abs/2402.10571.

Bai, Y., Jones, A., Ndousse, K., Askell, A., Chen, A., Das Sarma, N., Drain, D., Fort, S., Ganguli, D., Henighan, T., et al. Training a helpful and harmless assistant with reinforcement learning from human feedback. ar Xiv preprint ar Xiv:2204.05862, 2022.

Biderman, S., Schoelkopf, H., Anthony, Q. G., Bradley, H., O Brien, K., Hallahan, E., Khan, M. A., Purohit, S., Prashanth, U. S., Raff, E., et al. Pythia: A suite for analyzing large language models across training and scaling.

In International Conference on Machine Learning, pp. 2397 2430. PMLR, 2023.

Bradley, R. A. and Terry, M. E. Rank analysis of incomplete block designs: I. the method of paired comparisons. Biometrika, 39(3/4):324 345, 1952.

Browne, C. B., Powley, E., Whitehouse, D., Lucas, S. M., Cowling, P. I., Rohlfshagen, P., Tavener, S., Perez, D., Samothrakis, S., and Colton, S. A survey of monte carlo tree search methods. IEEE Transactions on Computational Intelligence and AI in games, 4(1):1 43, 2012.

Chen, G., Liao, M., Li, C., and Fan, K. Step-level value preference optimization for mathematical reasoning. ar Xiv preprint ar Xiv:2406.10858, 2024a.

Chen, G., Liao, M., Li, C., and Fan, K. Step-level value preference optimization for mathematical reasoning, 2024b. URL https://arxiv.org/abs/2406.10858.

Chen, Z., Wang, K., Wang, X., Peng, P., Izquierdo, E., and Lin, L. Deep co-space: Sample mining across feature transformation for semi-supervised learning. IEEE Transactions on Circuits and Systems for Video Technology, 28 (10):2667 2678, 2017.

Chen, Z., Huang, X., Guan, Q., Lin, L., and Luo, W. A retrospect to multi-prompt learning across vision and language. In Proceedings of the IEEE/CVF international conference on computer vision, pp. 22190 22201, 2023.

Chen, Z., Zheng, Y., Lai, Z.-R., Guan, Q., and Lin, L. Diagnosing and rectifying fake ood invariance: A restructured causal approach. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pp. 11471 11479, 2024c.

Cobbe, K., Kosaraju, V., Bavarian, M., Chen, M., Jun, H., Kaiser, L., Plappert, M., Tworek, J., Hilton, J., Nakano, R., et al. Training verifiers to solve math word problems, 2021. URL https://arxiv. org/abs/2110.14168, 2021.

Coulom, R. Efficient selectivity and backup operators in monte-carlo tree search. In International conference on computers and games, pp. 72 83. Springer, 2006.

Dong, H., Xiong, W., Goyal, D., Zhang, Y., Chow, W., Pan, R., Diao, S., Zhang, J., Shum, K., and Zhang, T. Raft: Reward ranked finetuning for generative foundation model alignment, 2023. URL https://arxiv.org/ abs/2304.06767.

Dong, H., Xiong, W., Pang, B., Wang, H., Zhao, H., Zhou, Y., Jiang, N., Sahoo, D., Xiong, C., and Zhang, T. Rlhf workflow: From reward modeling to online rlhf. ar Xiv preprint ar Xiv:2405.07863, 2024.

Language Models as Implicit Tree Search

Engstrom, L., Ilyas, A., Santurkar, S., Tsipras, D., Janoos, F., Rudolph, L., and Madry, A. Implementation matters in deep policy gradients: A case study on ppo and trpo, 2020. URL https://arxiv.org/abs/2005.12729.

Feng, X., Wan, Z., Wen, M., Mc Aleer, S. M., Wen, Y., Zhang, W., and Wang, J. Alphazero-like tree-search can guide large language model decoding and training. ar Xiv preprint ar Xiv:2309.17179, 2023.

Feng, X., Wan, Z., Wen, M., Mc Aleer, S. M., Wen, Y., Zhang, W., and Wang, J. Alphazero-like tree-search can guide large language model decoding and training, 2024. URL https://arxiv.org/abs/2309.17179.

Grill, J.-B., Altch e, F., Tang, Y., Hubert, T., Valko, M., Antonoglou, I., and Munos, R. Monte-carlo tree search as regularized policy optimization. pp. 3769 3778, 2020a.

Grill, J.-B., Altch e, F., Tang, Y., Hubert, T., Valko, M., Antonoglou, I., and Munos, R. Monte-carlo tree search as regularized policy optimization. In International Conference on Machine Learning, pp. 3769 3778. PMLR, 2020b.

Hao, S., Gu, Y., Ma, H., et al. Reasoning with language models is planning with a world model. ar Xiv preprint ar Xiv:2305.14992, 2023.

Hendrycks, D., Burns, C., Kadavath, S., Arora, A., Basart, S., Tang, E., Song, D., and Steinhardt, J. Measuring mathematical problem solving with the math dataset. ar Xiv preprint ar Xiv:2103.03874, 2021.

Kirk, R., Mediratta, I., Nalmpantis, C., Luketina, J., Hambro, E., Grefenstette, E., and Raileanu, R. Understanding the effects of rlhf on llm generalisation and diversity. ar Xiv preprint ar Xiv:2310.06452, 2023.

Kocsis, L. and Szepesv ari, C. Bandit based monte-carlo planning. In European conference on machine learning, pp. 282 293. Springer, 2006.

Liao, W., Chu, X., and Wang, Y. Tpo: Aligning large language models with multi-branch & multi-step preference trees. ar Xiv preprint ar Xiv:2410.12854, 2024.

Lightman, H., Kosaraju, V., Burda, Y., Edwards, H., Baker, B., Lee, T., Leike, J., Schulman, J., Sutskever, I., and Cobbe, K. Let s verify step by step. ar Xiv preprint ar Xiv:2305.20050, 2023.

Liu, A., Bai, H., Lu, Z., Sun, Y., Kong, X., Wang, S., Shan, J., Jose, A. M., Liu, X., Wen, L., Yu, P. S., and Cao, M. Tis-dpo: Token-level importance sampling for direct preference optimization with estimated weights, 2024a. URL https://arxiv.org/abs/2410.04350.

Liu, Z., Lu, M., Zhang, S., Liu, B., Guo, H., Yang, Y., Blanchet, J., and Wang, Z. Provably mitigating overoptimization in rlhf: Your sft loss is implicitly an adversarial regularizer, 2024b. URL https://arxiv.org/ abs/2405.16436.

Mitchell, E. A note on dpo with noisy preferences & relationship to ipo. ar Xiv preprint ar Xiv:2304.12345, 2023a.

Mitchell, E. A note on dpo with noisy preferences & relationship to ipo, 2023b.

Ouyang, Y., Wang, L., Yang, F., Zhao, P., Huang, C., Liu, J., Pang, B., Yang, Y., Zhan, Y., Sun, H., Lin, Q., Rajmohan, S., Deng, W., Zhang, D., Sun, F., and Zhang, Q. Token-level proximal policy optimization for query generation, 2024. URL https://arxiv.org/abs/ 2411.00722.

Pan, L., Albalak, A., Wang, X., and Wang, W. Y. Logiclm: Empowering large language models with symbolic solvers for faithful logical reasoning. In The 2023 Conference on Empirical Methods in Natural Language Processing.

Rafailov, R., Hejna, J., Park, R., and Finn, C. Your language model is secretly a q-function. ar Xiv preprint ar Xiv:2404.12358, 2024a.

Rafailov, R., Sharma, A., Mitchell, E., Manning, C. D., Ermon, S., and Finn, C. Direct preference optimization: Your language model is secretly a reward model. Advances in Neural Information Processing Systems, 36, 2024b.

Rafailov, R., Sharma, A., Mitchell, E., Manning, C. D., Ermon, S., and Finn, C. Direct preference optimization: Your language model is secretly a reward model. Advances in Neural Information Processing Systems, 36, 2024c.

Rosin, C. D. Multi-armed bandits with episode context. Annals of Mathematics and Artificial Intelligence, 61(3): 203 230, 2011.

Schrittwieser, J., Antonoglou, I., Hubert, T., Simonyan, K., Sifre, L., Schmitt, S., Guez, A., Lockhart, E., Hassabis, D., Graepel, T., et al. Mastering atari, go, chess and shogi by planning with a learned model. Nature, 588(7839): 604 609, 2020.

Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. Proximal policy optimization algorithms. ar Xiv preprint ar Xiv:1707.06347, 2017.

Shao, Z., Wang, P., Zhu, Q., Xu, R., Song, J., Bi, X., Zhang, H., Zhang, M., Li, Y., Wu, Y., et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. ar Xiv preprint ar Xiv:2402.03300, 2024.

Language Models as Implicit Tree Search

Silver, D., Hubert, T., Schrittwieser, J., Antonoglou, I., Lai, M., Guez, A., Lanctot, M., Sifre, L., Kumaran, D., Graepel, T., et al. Mastering chess and shogi by self-play with a general reinforcement learning algorithm. ar Xiv preprint ar Xiv:1712.01815, 2017a.

Silver, D., Schrittwieser, J., Simonyan, K., Antonoglou, I., Huang, A., Guez, A., Hubert, T., Baker, L., Lai, M., Bolton, A., et al. Mastering the game of go without human knowledge. nature, 550(7676):354 359, 2017b.

Tafjord, O., Mishra, B. D., and Clark, P. Proofwriter: Generating implications, proofs, and abductive statements over natural language. ar Xiv preprint ar Xiv:2012.13048, 2020.

Team, Q. Introducing qwen1.5, February 2024. URL https://qwenlm.github.io/blog/qwen1. 5/.

Uesato, J., Kushman, N., Kumar, R., et al. Solving math word problems with process-and outcome-based feedback. ar Xiv preprint ar Xiv:2211.14275, 2022.

Wang, X., Song, L., Tian, Y., Yu, D., Peng, B., Mi, H., Huang, F., and Yu, D. Towards self-improvement of llms via mcts: Leveraging stepwise knowledge with curriculum preference learning, 2024a. URL https: //arxiv.org/abs/2410.06508.

Wang, X., Song, L., Tian, Y., Yu, D., Peng, B., Mi, H., Huang, F., and Yu, D. Towards self-improvement of llms via mcts: Leveraging stepwise knowledge with curriculum preference learning, 2024b. URL https: //arxiv.org/abs/2410.06508.

Wang, Y., Liu, Q., and Jin, C. Is rlhf more difficult than standard rl? a theoretical perspective. Advances in Neural Information Processing Systems, 36:76006 76032, 2023.

Wei, J., Wang, X., Schuurmans, D., Bosma, M., Xia, F., Chi, E., Le, Q. V., Zhou, D., et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824 24837, 2022.

Xie, Y., Goyal, A., Zheng, W., Kan, M.-Y., Lillicrap, T. P., Kawaguchi, K., and Shieh, M. Monte carlo tree search boosts reasoning via iterative preference learning. ar Xiv preprint ar Xiv:2405.00451, 2024a.

Xie, Y., Goyal, A., Zheng, W., Kan, M.-Y., Lillicrap, T. P., Kawaguchi, K., and Shieh, M. Monte carlo tree search boosts reasoning via iterative preference learning. ar Xiv preprint ar Xiv:2405.00451, 2024b.

Xie, Y., Goyal, A., Zheng, W., Kan, M.-Y., Lillicrap, T. P., Kawaguchi, K., and Shieh, M. Monte carlo tree search

boosts reasoning via iterative preference learning, 2024c. URL https://arxiv.org/abs/2405.00451.

Yao, S., Yu, D., Zhao, J., et al. Tree of thoughts: Deliberate problem-solving with large language models. ar Xiv preprint ar Xiv:2305.10601, 2023.

Yuan, W., Pang, R. Y., Cho, K., Li, X., Sukhbaatar, S., Xu, J., and Weston, J. Self-rewarding language models, 2024. URL https://arxiv.org/abs/2401.10020.

Yuan, Z., Yuan, H., Tan, C., Wang, W., Huang, S., and Huang, F. Rrhf: Rank responses to align language models with human feedback without tears, 2023. URL https: //arxiv.org/abs/2304.05302.

Zeng, Y., Liu, G., Ma, W., Yang, N., Zhang, H., and Wang, J. Token-level direct preference optimization. ar Xiv preprint ar Xiv:2404.11999, 2024a.

Zeng, Y., Liu, G., Ma, W., Yang, N., Zhang, H., and Wang, J. Token-level direct preference optimization, 2024b. URL https://arxiv.org/abs/2404.11999.

Zhang, D., Zhoubian, S., Hu, Z., Yue, Y., Dong, Y., and Tang, J. Rest-mcts*: Llm self-training via process reward guided tree search. ar Xiv preprint ar Xiv:2406.03816, 2024a.

Zhang, X., Du, C., Pang, T., Liu, Q., Gao, W., and Lin, M. Chain of preference optimization: Improving chainof-thought reasoning in llms, 2024b. URL https:// arxiv.org/abs/2406.09136.

Zhong, H., Feng, G., Xiong, W., Cheng, X., Zhao, L., He, D., Bian, J., and Wang, L. Dpo meets ppo: Reinforced token optimization for rlhf. ar Xiv preprint ar Xiv:2404.18922, 2024a.

Zhong, H., Feng, G., Xiong, W., Cheng, X., Zhao, L., He, D., Bian, J., and Wang, L. Dpo meets ppo: Reinforced token optimization for rlhf, 2024b. URL https:// arxiv.org/abs/2404.18922.

Zhou, W., Zhang, S., Zhao, L., and Meng, T. T-reg: Preference optimization with token-level reward regularization, 2024. URL https://arxiv.org/abs/2412. 02685.

Ziebart, B. D. Modeling purposeful adaptive behavior with the principle of maximum causal entropy. Carnegie Mellon University, 2010.

Language Models as Implicit Tree Search

A. Related Work

The integration of search-based optimization techniques in language model alignment has significantly advanced AI alignment, with Reinforcement Learning with Human Feedback (RLHF)(Bai et al., 2022; Wang et al., 2023; Kirk et al., 2023; Dong et al., 2024) being a widely used approach. RLHF employs reward models trained from human feedback to optimize model behavior through reinforcement learning, typically using Proximal Policy Optimization (PPO) (Zhong et al., 2024b). However, RLHF has been criticized for instability, sample inefficiency, and over-optimization issues (Engstrom et al., 2020; Liu et al., 2024b; Chen et al., 2023; 2017; 2024c). To address these challenges, methods like Reward Ranked Fine Tuning (RAFT) (Dong et al., 2023) and Rank Responses to align Human Feedback (RRHF) (Yuan et al., 2023) have been proposed to refine ranking-based optimization without explicit reinforcement learning. More recently, Direct Preference Optimization (DPO) (Rafailov et al., 2024c; Amini et al., 2024) has emerged as an alternative, allowing language models to be aligned directly from human preference data without requiring a reward model.

Unlike PPO, which operates within a reinforcement learning framework by optimizing policies through reward feedback, DPO reformulates the alignment problem as a supervised learning task, making policy updates more stable while maintaining alignment with human preferences. Further refinements, such as processing the paragraph at token level with methods such as TDPO (Zeng et al., 2024b), T-REG (Zhou et al., 2024), TPPO (Ouyang et al., 2024), and TIS-DPO (Liu et al., 2024a), improve efficiency by incorporating token-wise adjustments, improving preference alignment in a manner that contrasts with RLHF s reliance on policy gradient updates.

Monte Carlo Tree Search (MCTS) has been extensively applied in decision-making tasks, particularly in game-playing AI, as demonstrated by its success in Alpha Go (Silver et al., 2017b), Alpha Zero (Silver et al., 2017a), and Mu Zero (Schrittwieser et al., 2020). Recent advancements have expanded its application to large language models (LLMs) for structured text generation, such as in Alpha Zero-like Tree-Search for LLMs (TS-LLM) (Feng et al., 2023). Similarly, Xie et al. (Xie et al., 2024c) propose an iterative preference learning approach, using MCTS to refine step-wise reasoning capabilities in LLMs. Further developments by Wang et al. (Wang et al., 2024b) introduce self-improvement techniques where LLMs leverage MCTS for preference-guided reinforcement learning, while Chen et al. (Chen et al., 2024b) focus on step-level value preference optimization, allowing fine-grained preference learning at intermediate steps. Zhang et al. (Zhang et al., 2024b) extend this concept with chain preference optimization, improving long-range decision-making for complex reasoning tasks. Lastly, Liao et al. (Liao et al., 2024) introduce Tree-based Preference Optimization (TPO), which integrates MCTS with preference alignment techniques to refine LLM outputs progressively. Despite these advancements, a key challenge remains: these approaches still require learning an explicit reward value function, which is critical for optimizing the search process and improving the efficiency of LLM training and inference.

Our work proposes Implicit Tree Search (ITS), which integrates stochastic policy optimization with Alpha Zero-like search principles while eliminating explicit tree structures. ITS leverages reversed KL-divergence constraints and a stochastic sampling policy to enhance exploration expressiveness, addressing cold-start issues common in MCTS-based methods. Similar ideas have been explored in Monte Carlo-based regularized policy optimization (Grill et al., 2020a), (Wang et al., 2024a) they use pairwise training framework that enables LLMs to self-improve through MCTS behavior but ITS extends these principles to preference learning in language models.

Furthermore, our approach aligns with broader research on AI safety and preference-based learning (Yuan et al., 2024; Xie et al., 2024a; Chen et al., 2024a; Mitchell, 2023a). ITS represents a novel search-and-learn paradigm that improves structured reasoning in language models without relying on explicit tree expansion, bridging the gap between MCTS and modern alignment techniques.

C.1. Fundamentals of Monte Carlo Tree-Search Methods

Monte Carlo Tree Search (MCTS) has been widely adopted as an effective strategy for solving problems requiring sequential decision-making and planning. Traditional MCTS operations, as introduced by (Kocsis & Szepesv ari, 2006) and (Coulom, 2006), include four key steps: selection, expansion, simulation, and backpropagation. However, these operations can be adapted for more advanced frameworks like Alpha Zero (Silver et al., 2017b), which incorporates a learned value function and policy network to guide the search process.

To address challenges in balancing exploration and exploitation, we utilize a variant of MCTS with a modified Predictor Upper Confidence Tree (PUCT) algorithm (Rosin, 2011). The algorithm selects actions at at each node st as follows:

at = arg max a Q(st, a) + U(st, a) ,

Language Models as Implicit Tree Search

where U(st, a) is calculated using the formula:

U(s, a) = cpuct πθ(s, a)

b N(s, b) 1 + N(s, a) .

Here, N(s, a) represents the visit count of action a at node s, and cpuct is a constant controlling exploration, as defined by:

cpuct = log P

b N(s, b) + cbase + 1

Node Expansion and Assessment: Upon reaching a leaf node s L, if it is not terminal, the tree is expanded by generating possible successor nodes. The value of the leaf node is then estimated using a neural network. For terminal nodes, a reward function R(s L) is used, or an Outcome Reward Model (ORM) serves as an approximation (Uesato et al., 2022).

Value Propagation: Once a leaf node is assessed, the computed values are propagated back through the path s0, s1, . . . , s L. For each node, the visit count is updated as:

N(st, a) = N(st, a) + 1,

and the cumulative action value is updated as:

W(st, a) = W(st, a) + v(s L).

The mean action value is then computed as:

Q(st, a) = W(st, a)

This combination of learned value functions and search-based methods enables efficient exploration of large decision spaces, as demonstrated in Alpha Zero (Silver et al., 2017b) and its applications to tree-search-guided language models (Yao et al., 2023; Hao et al., 2023).

B.1. Proof of Proposition 3.1

Proof. Given r R and a AMCTS/{a}, n(s, a ) are fixed, the probability of empirical visit distribution to the action a denotes as 1+n(s,a)r

a AMCTS n(s,a )r . It holds

1 + n(s, a)r

|AMCTS| + P

a AMCTS n(s, a )r = 1 1 (|AMCTS| + P

a AMCTS/{a} n(s, a )r)

|AMCTS| + P

a AMCTS/{a} n(s, a )r + n(s, a)r (22)

Since |AMCTS| + P

a AMCTS/{a} n(s, a )r is constant, the exponential empirical visit distribution only changes along with n(s, a )r changes. Due to |AMCTS| > 1 as tree definition, |AMCTS| + P

a AMCTS/{a} n(s, a )r > 1. Note that f(x) = 1 1 c

c+x is bijective with respect to c > 1 and when r = 0, {nr : n Z {0}} holds the consistent cardinality with Z. The proposition has been proved.

B.2. Proof of Proposition 3.2

Proof. To prove the proposition, we only need to prove the following lemma:

Lemma B.1. Given positive integers a, b, and {ci}a i=1, define

f(r) = 1 + br

a + 1 + Pa i=1 cr i + br .

We claim that for all real r, the value of f(r) lies strictly within the open interval 1 a+1, 1

Language Models as Implicit Tree Search

Proof. 1. As r : Since br 0 and cr i 0 for each i, we have

lim r f(r) = lim r 1 + br

a + 1 + Pa i=1 cr i + br = 1 + 0 a + 1 + 0 + 0 = 1 a + 1.

Thus f(r) never goes below 1/(a + 1).

2. As r + : Let M = max{b, c1, c2, . . . , ca}. For sufficiently large r, M r dominates br and each cr i , so the numerator and denominator in f(r) are asymptotically proportional to M r, giving

lim r + f(r) = 1.

Since br and cr i increase with r, one can show f(r) itself is strictly increasing in r. Consequently, its image over r R is precisely 1 a+1, 1 .

Counterexample. Because f(r) is bounded below by 1 a+1, any real x with

0 < x < 1 a + 1

cannot be realized by f(r). For instance, choose

x0 = 1 2(a + 1).

0 < 1 2(a + 1) < 1 a + 1,

but there is no real r for which f(r) = x0. Hence, values in 0, 1 a+1 are not attainable.

Set a = |AMCTS| 1, b = n(s, a), and ci = n(s, ai), ai AMCTS/{a}, the proposition has been proved.

B.3. Proof of Lemma 4.1

Proof. As discussed in Lemma.3.4,

πφ(at|st) = λN(st) πθ(at|st) α(st) Qπθ(st, at) r(st, at) = α(st) V πθ(st) λN(st) πθ(at|st)

πφ(at|st), (23)

where Qπθ(st, at) = r(st, at) + V πθ(st). Provided (x, y(w)) denoting a pair of prompt and its preferred response, we construct the preferred state-action trajectory along with the token-level MDP, i.e., {(s(w) t , a(w) t )}Tw t=1 w.r.t. s(w) t = (x, y(w) <t ) and a(w) t = y(w) t . Then based on the definition, for the t-th state in the preferred state-action trajectory, we sample another action ˆai from the parameterized ITS exploration policy πφ i.e., ˆat πφ( |s(w) t ). For the preferred state-action trajectory, Eq.23 holds at each step so that

R(x, y(w)) =

t=1 r(s(w) t , a(w) t ) =

α(s(w) t ) V πθ(s(w) t ) λN(s(w) t ) πθ(a(w) t |s(w) t )

πφ (a(w) t |s(w) t )

r(s(w) t , ˆat) = α(s(w) t ) V πθ(s(w) t ) λN(s(w) t ) πθ(ˆat|s(w) t )

πφ (ˆat|s(w) t ) . (25)

Language Models as Implicit Tree Search

t=1 r(s(w) t , ˆat) =

α(s(w) t ) V πθ(s(w) t ) λN(s(w) t ) πθ(a(w) t |s(w) t )

πφ (a(w) t |s(w) t )

α(s(w) t ) V πθ(s(w) t ) λN(s(w) t ) πθ(ˆat|s(w) t )

πφ (ˆat|s(w) t )

= λN(s(w) t ) πθ(a(w) t |s(w) t )

πφ (a(w) t |s(w) t ) πθ(ˆat|s(w) t )

πφ (ˆat|s(w) t )

Therefore R(x, y(w)) = PTw t=1

r(s(w) t , ˆat) λN(s(w) t ) πθ(a(w) t |s(w) t )

πφ(a(w) t |s(w) t ) πθ(ˆat|s(w) t )

πφ(ˆat|s(w) t )

The dispreferred state-action trajectory from (x, yl) can be also constructed in the same routine, i.e., {(s(l) t , a(l) t )}Tl t=1 w.r.t. s(l) t = (x, y(l) <t) and a(l) t = y(l) t . For the i-th state in the dispreferred state-action trajectory, we sample another action ˇai from πφ i.e., ˇai πφ( |s(l) t ). For the dispreferred state-action trajectory, it holds the mirror formulations of Eq.24-26 , then

from the same deduction, R(x, y(l)) = PTl t=1

r(s(l) t , ˇat) λN(s(l) t ) πθ(a(l) t |s(l) t )

πφ(a(l) t |s(l) t ) πθ(ˇat|s(l) t )

πφ(ˇat|s(l) t )

B.4. Proof of Theorem 4.2

Proof. Let s introduce the dense reward reparameterzation theory in (Rafailov et al., 2024a):

Lemma B.2. (Reparameterzed dense reward) (Rafailov et al., 2024a) Given a reference policy πref and a parameter β > 0 all reward classes consistent with the Plackett-Luce (and Bradley-Terry) models, the step-wise reward r(st, at) can be represented with the a re-parameterization

r(st, at) = β log π(at|st) β log πref(at|st) (27)

within the token MDP where V (st) = 0 for all terminal state.

Here we combine Lemma.4.1 and Lemma.B.2 to certify our theorem. Specifically, V πθ(s T ) = V (s T ) = 0 for all terminal state given any sequence, since θ is optimized from DPO-like algorithms. By Lemma.B.2, we have the dense reward decomposition

r(s(w) t , ˆat) = β log πθ(ˆat|s(w) t ) β log πref(ˆat|s(w) t ) = β log πθ(ˆat|s(w) t )

πref(ˆat|s(w) t ) , t [Tw]. (28)

The decomposition holds in the dispreferred states such that

r(s(l) t , ˇat) = β log πθ(ˇat|s(l) t ) β log πref(ˇat|s(l) t ) = β log πθ(ˇat|s(l) t )

πref(ˇat|s(l) t ) , t [Tl]. (29)

where the dispreferred state-action trajectory along with the token-level MDP, i.e., {(s(l) t , a(l) t )}Tl t=1 is constructed by

Language Models as Implicit Tree Search

s(l) t = (x, y(l) <t) and a(l) t = y(l) t , t [Tl]. Given this, we have

Uss(x, y(w), y(l)) =R(x, y(w)) R(x, y(l))

= R(x, y(w))

t=1 r(s(w) t , ˆat) + Tw X

t=1 r(s(w) t , ˆat)

t=1 r(s(l) t , ˇat) + Tl X

t=1 r(s(l) t , ˇat) R(x, y(l))

t=1 λN(s(w) t ) πθ(a(w) t |s(w) t )

πφ(a(w) t |s(w) t ) πθ(ˆat|s(w) t )

πφ(ˆat|s(w) t )

| {z } µw(φ;θ)

t=1 λN(s(l) t ) πθ(a(l) t |s(l) t )

πφ(a(l) t |s(l) t ) πθ(ˇat|s(l) t )

πφ(ˇat|s(l) t )

| {z } µl(φ;θ)

t=1 β log πθ(ˆat|s(w) t )

πref(ˆat|s(w) t )

t=1 β log πθ(ˇat|s(l) t )

πref(ˇat|s(l) t ) | {z } δθ (30) The theorem has been proved.

B.5. Proof of Theorem.4.3

Here we provide the formal proofs to the sentence-level version of Lemma.4.1 and Theorem.3.3. They are based on the definition of Sentence-level MDP:

Definition B.3. (Sentence-level MDP) Suppose that M = (S, A, P, R, T ) denotes the token-level MDP. It identifies the sentence-level MDP MS =(SS,AS,PS,RS,TS) as

n=1 Sn; AS = AMCTS =

n=1 An; PS : SS AS SS; RS : SS AS R; TS = T . (31)

where Sn indicates the sequence composed of n sentences ended by \n in order and obviously, Sn S. An is the sentence-level action space with n-length tokens and N indicates the maximum length of sentences. The transition kernel PS and reward space RS holds PS S AS S and RS S AS R. To this, if we consider the policies hold the decomposition πS(At|st) = QNt i=1 π(a(i) t |s(i) t ) (At = (a(i) t )Nt i=1 AS denotes the sentence-level action, Nt indicates the token number in the selected sentence action, a(i) t indicates the i-th token in the t-th sentence, and s(i) = (s, a(1)) t , , a(i 1)) t .) RS S AS R, to this, we define r S(Ai, s) RS such that r S(st, Ai) = PNt i r(s(i) t , a(i) t ).

Lemma B.4. Given the prompt x with its preference response pair y(w), y(l) composed of T S w, T S l sentences, A(w) t , A(l) t denotes the t-th sentence in y(w), y(l), respectively. Suppose the state s(w) t ,s(l) t transmits along sentence-level MDP, a(w,i) t /a(l,i) t indicates the i-th token in the A(w) t /A(l) t , s(w,i) t /s(l,i) t denotes the sequential context ahead of a(w,i) t /a(l,i) t in y(w)/y(l). Suppose that the sentence-level policies πS ref, πS θ , πS φ are identified by LM-based token-selection policies πref,

πθ, πφ, respectively; ˆAt = (ˆa(1) t , , ˆa( ˆ Nt) t ) πS φ( |s(w) t ),ˆs(0) t = s(w) t ,ˆs(i+1) t = P(ˆs(i) t , ˆa(i) t ); ˇAt = (ˇa(1) t , , ˇa( ˇ Nt) t )

πS φ( |s(l) t ),ˇs(0) t = s(l) t ,ˇs(i+1) t = P(ˇs(i) t , ˇa(i) t ). It holds

R(x, y(w)) =

i=1 r(ˆs(i) t , ˆa(i) t ) λN(s(w) t ) |A(w) t | Y

πθ(a(w,i) t |s(w,i) t )

πφ(a(w,i) t |s(w,i) t )

πθ(ˆa(i) t |ˆs(i) t )

πφ(ˆa(i) t |ˆs(i) t )

R(x, y(l)) =

i=1 r(ˇs(i) t , ˇa(i) t ) λN(s(l) t ) |A(l) t | Y

πθ(a(l,i) t |s(l,i) t )

πφ(a(l,i) t |s(l,i) t )

πθ(ˇa(i) t |ˇs(i) t )

πφ(ˇa(i) t |ˇs(i) t )

Proof. Suppose that QπS θ (st, ) denotes the state-action value function on the sentence-level policy πS θ with respect to LM θ. Due to πS θ defined with the action space AS = AMCTS, it results in

πS φ(At|st) = λN(st) πS θ (At|st) α(st) QπS θ (st, At) r S(st, At) = α(st) V πS θ (st) λN(st)πS θ (At|st) πS φ(At|st), (33)

Language Models as Implicit Tree Search

where At = (a(i) t )Nt i=1 AS denotes the sentence-level action, Nt indicates the token number in the selected sentence action, and a(i) t indicates the i-th token in the t-th sentence. It holds QπS θ (st, At) = r S(st, At) + V πθ(st) = PNt i r(s(i) t , a(i) t ) + V πS θ (st).

Provided (x, y(w)) denoting a pair of prompt and its preferred response, we construct the preferred state-action trajectory along with the sentence-level MDP. In particular, suppose T S w denotes the number of sentences included in the preferred response y(w), which can be uniquely identified through the location of \n tokens. We denote them as a sequence of

sentence-level actions {A(w) t }T S w t=1. t T S w, and make the decomposition such that {(s(w) t , A(w) t )}T S w t=1 with respect to

A(w) t =(y(w) Pt 1 j=1 N (w) j +1, , y(w) Pt 1 j=1 N (w) j +N (w) t )

=(a(w,1) t , , a(w,N (w) t ) t )

s(w) t =(x, A(w) 1 , , A(w) t 1),

Then for each (s(w) t , A(w) t ) s(w,i) t = (s(w) t , a(w,1) t , , a(w,i 1) t ), i [N (w) t ]. (35)

where a(w,i) t denotes the response token y(w) Pt 1 j=1 N(w) j +i in y(w).

Then we provide the derivation

R(x, y(w)) =

i r(s(i) t , a(i) t ) =

t=1 r S(s(w) t , At)

α(s(w) t ) V πS θ (s(w) t ) λN(s(w) t )πS θ (A(w) t |s(w) t )

πS φ(A(w) t |s(w) t )

α(s(w) t ) V πθ(s(w) t ) λN(s(w) t ) QN(w) t i=1 πθ(a(w,i) t |s(w,i) t ) QN(w) t i=1 πφ(a(w,i) t |s(w,i) t )

α(s(w) t ) V πθ(s(w) t ) λN(s(w) t )

πθ(a(w,i) t |s(w,i) t )

πφ(a(w,i) t |s(w,i) t )

Beyond this, for the t-th state in the preferred state-action trajectory, we sample another action ˆAt from the parameterized

ITS exploration policy πS φ i.e., ˆAt = (ˆa1 t, , ˆa( ˆ Nt) t ) πS φ( |s(w) t ). It can be telescoped into {ˆa(i) t πφ( |ˆs(i) t )} ˆ Nt i=1 with

respect to ˆs(i) t = (x, A(w) 1 , , A(w) t 1, ˆa(1) t , , ˆa(i 1) t ). Therefore t [T S w], it holds

r S(s(w) t , ˆAt) =

i=1 r(ˆa(i) t , ˆs(i) t ) =α(s(w) t ) V πS θ (s(w) t ) λN(s(w) t )πS θ ( ˆAt|s(w) t )

πS φ( ˆAt|s(w) t )

=α(s(w) t ) V πS θ (s(w) t ) λN(s(w) t ) Q ˆ Nt i=1 πθ(ˆa(i) t |ˆs(i) t ) Q ˆ Nt i=1 πφ(ˆa(i) t |ˆs(i) t )

=α(s(w) t ) V πS θ (s(w) t ) λN(s(w) t )

πθ(ˆa(i) t |ˆs(i) t )

πφ(ˆa(i) t |ˆs(i) t )

Derived from the similar routine in Eq.26, we combine Eq.36, 37 to obtain

R(x, y(w)) =

i=1 r(ˆs(i) t , ˆa(i) t ) λN(s(w) t ) N(w) t Y

πθ(a(w,i) t |s(w,i) t )

πφ(a(w,i) t |s(w,i) t )

πθ(ˆa(i) t |ˆs(i) t )

πφ(ˆa(i) t |ˆs(i) t )

Language Models as Implicit Tree Search

For the dispreferred response y(l), it refers to the mirror notation of (w) (l) to construct the dispreferred state-action

trajectory {(s(l) t , A(l) t )}T S l t=1 and for each (s(l) t , A(l) t ), a(l,i) t = y(w) Pt 1 j=1 N(l) j +i and s(l,i) t = (s(l) t , a(l,1) t , , a(l,i 1) t ), i

[N (l) t ] in y(l). Given this, for the t-th state in the dispreferred state-action trajectory, instead of A(l) t , we sample another sentence action ˇAt from the parameterized ITS exploration policy πS φ i.e., ˇAt = (ˆa1 t, , ˇa( ˇ Nt) t ) πS φ( |s(l) t ). It can

be telescoped into {ˇa(i) t πφ( |ˇs(i) t )} ˇ Nt i=1 with respect to ˇs(i) t = (x, A(l) 1 , , A(l) t 1, ˇa(1) t , , ˇa(i 1) t ). From the same deduction routine, it holds

r S(s(l) t , ˇAt) = α(s(l) t ) V πS θ (s(l) t ) λN(s(l) t )

πθ(ˇa(i) t |ˇs(i) t )

πφ(ˇa(i) t |ˇs(i) t ) , t [T S l ] (39)

R(x, y(l)) =

i=1 r(ˇs(i) t , ˇa(i) t ) λN(s(l) t ) N (l) t Y

πθ(a(l,i) t |s(l,i) t )

πφ(a(l,i) t |s(l,i) t )

πθ(ˇa(i) t |ˇs(i) t )

πφ(ˇa(i) t |ˇs(i) t )

On top of Lemma.B.4, we prove Theorem 4.3 as follows

Proof. According to Lemm.B.2, we have

r(ˆs(i) t , ˆa(i) t ) = β log πθ(ˆa(i) t |ˆs(i) t )

πref(ˆa(i) t |ˆs(i) t ) , s.t. t T S w, i ˆNt;

r(ˇs(i) t , ˇa(i) t ) = β log πθ(ˇa(i) t |ˇs(i) t )

πref(ˇa(i) t |ˇs(i) t ) , s.t. t T S l , i ˇNt.

Usa(x, y(w), y(l)) = R(x, y(w)) R(x, y(l))

= R(x, y(w))

i=1 r(ˆs(i) t , ˆa(i) t ) + T S w X

i=1 r(ˆs(i) t , ˆa(i) t )

i=1 r(ˇs(i) t , ˇa(i) t ) + T S l X

i=1 r(ˇs(i) t , ˇa(i) t ) R(x, y(l))

= R(x, y(w))

i=1 r(ˆs(i) t , ˆa(i) t ) + R(x, y(l))

i=1 r(ˇs(i) t , ˇa(i) t )

+ β T S w X

i=1 log πθ(ˆa(i) t |ˆs(i) t )

πref(ˆa(i) t |ˆs(i) t )

i=1 log πθ(ˇa(i) t |ˇs(i) t )

πref(ˇa(i) t |ˇs(i) t )

t=1 λN(s(w) t ) N(w) t Y

πθ(a(w,i) t |s(w,i) t )

πφ(a(w,i) t |s(w,i) t )

πθ(ˆa(i) t |ˆs(i) t )

πφ(ˆa(i) t |ˆs(i) t )

| {z } µS w(φ,θ)

t=1 λN(s(l) t ) N(l) t Y

πθ(a(l,i) t |s(l,i) t )

πφ(a(l,i) t |s(l,i) t )

πθ(ˇa(i) t |ˇs(i) t )

πφ(ˇa(i) t |ˇs(i) t )

| {z } µS l (φ,θ)

+ β T S w X

i=1 log πθ(ˆa(i) t |ˆs(i) t )

πref(ˆa(i) t |ˆs(i) t )

i=1 log πθ(ˇa(i) t |ˇs(i) t )

πref(ˇa(i) t |ˇs(i) t )

| {z } δS(θ) (42) Set up N (w) i = |A(w) t | and N (l) i = |A(l) t |, then the theorem has been proved.

Language Models as Implicit Tree Search

Algorithm 1 The algorithm pipeline of IT-PO

Input: preference dataset D; pre-trained LLMs θ, φ. Hyper-parameters: ϵ; K; the batch size M to construct the batch of c DPO; the number of alternative training nalter. Output: LLMs θ , φ . Initialize LLMs θ, φ by SF with the prompt-response pairs drawn from D, Nalter = 0; Minimize Lc DPO(θ) to post-train LLM θ with the prompt and its pairwise responses drawn from the dataset D; repeat

Implicit-Tree Preference Optimization: repeat

Construct the training batch by M triplets drawn from D; For each triplet (x, y(w), y(l)), construct {(a(w) t , s(w) t )}Tw t=1, {(a(l) t , s(l) t )}Tl t=1 based on the rule of token-level MDP (w.r.t. Lss-IT-PO(φ; θ) for LM alignment) or sentence-level MDP (w.r.t. Las-IT-PO(φ; θ) for LM Reasoning); Set Kw = K

Tw , t Tw (parallel), draw Kw tokens from πφ( |s(w) t ) to construct the samples of token-level action

random variable ˆat, or draw Kw sentences from πS φ( |s(w) t ) to construct the samples of the sentence-level action random variable ˆAt; Set Kl = K

Tl , t Tl (parallel), draw Kl tokens from πφ( |s(l) t ) to construct the samples of token-level action

random variable ˇat, or draw Kl sentences from πS φ( |s(l) t ) to construct the samples of the sentence-level action random variable ˇAt; Fix θ, minimize Lss-IT-PO(φ; θ) to update φ for LM alignment or minimize Las-IT-PO(φ; θ) to update φ for LM reasoning; until The number of training epoches reach the same of Lc DPO(θ). Preference Policy Augmentation: For each prompt x in D, generate K responses {yk}K i=1 using πφ; For each prompt x in D and i = j [K], calculate Uss(x, yi, yj) or Uas(x, yi, yj); Build the token-level policy augmentation pool D+ ss = {(x, yi, yj)| i = j [K], Uss(x, yi, yj) > 0, (x, yi, yj) top K({Uss(x, yi, yj)}i =j [K]), or build the sentence-level policy augmentation pool D+ as = {(x, yi, yj)| i = j [K], Uas(x, yi, yj) > 0, (x, yi, yj) top K({Uas(x, yi, yj)}i =j [K]) Minimize Lc DPO(θ) to post-train LLM θ with the prompt and its pairwise responses drawn from D D+ ss or D D+ as; Nalter = Nalter + 1; until Nalter = nalter θ θ, φ φ.

C. Implementation

The algorithm pipeline of IT-PO are presented in Algorithm.1. The algorithm implementation almost holds the consistency with its theoretical foundation except for two details:(1) the specification of λN(st); (2) the φ s gradient implementation in Lss-IT-PO(φ; θ) and Las-IT-PO(φ; θ).

Suppose that λN(s(w) t ), λN(s(l) t ) denote the number of reasoning paths generated from the state st, which consists of a prompt x and its responses y(w), y(l) with their contents in the previous t 1 steps. For each preference pair, we set t {1, . . . , Tw}, λN(s(w) t ) = Kw and t {1, . . . , Tl}, λN(s(l) t ) = Kl, where Kw, Kl denote how many search start from the t-th leaf nodes s(w) t , s(l) t , respectively. Notice that Tw, Tl change with respect to the preference pair. So IT-PO adaptively configure Kw = K

Tw , Kl = K

Tl to balance the optimization with different response lengths in y(w), y(l) for each pair. We set K = 8, inspired from the number of sampled responses for each prompt in many RLHF implementations.

In φ s gradient analysis in Lss-IT-PO(φ; θ), the gradient of φ consists of terms in the form πθ(at|st)

πφ(at|st) log[ π(at|st)]. Due to

πθ(at|st) πφ(at|st) (0, + ), updating the models with πθ(at|st)

πφ(at|st) log[ π(at|st)] may suffer from exploding/vanishing gradients.

To this, we used their logarithmic scaling πθ(at|st)

πφ(at|st) exp(log(πθ(at|st)) log(πφ(at|st))) to ensure the less sensitive

Language Models as Implicit Tree Search

Table 5. Task setups. The node, tree max width, and tree max depth are search space parameters. The max tree-width and tree-depth follow the empirical experience in (Feng et al., 2023).

Task Task Category Train/test size Node Tree Max width Tree Max depth

GSM8k Mathematical Reasoning 7.5k / 1.3k Sentence 6 8

Game24 Mathematical Planning 1.0k / 0.3k Sentence 20 4

MATH Mathematical Reasoning - / - Sentence 6 8

Table 6. Performance Comparison on Proof Writer and Chess Endgame Tasks

Setting Baselines Proof Writer (Acc %) Chess Endgame (Win rate %)

Path@1 Co T-greedy 37.72 58.14

BFS-V 48.94 67.75

MCTS-α 66.71 96.90

MCTS-rollout 69.23 98.76

ITS-α (ours) 71.77 99.21

ITS-rollout (ours) 75.31 99.83

Equal-Token Co T-SC-MAJ 36.50 9.84

Co T-SC-MAJ 36.58 73.80

BFS-V-ORM 63.42 93.18

MCTS-ORM 60.86 94.26

ITS-α (ours) 74.26 96.48

ITS-rollout (ours) 78.15 98.57

update ratio. In φ s gradient analysis in Las-IT-PO(φ; θ), it can derive the similar gradient as

φµS w(φ, θ) =

t=1 λN(s(w) t ) |A(w) t | Y

πθ(a(w,i) t |s(w,i) t )

πφ(a(w,i) t |s(w,i) t )

|A(w) t | X

j=1 log[πφ(a(w,j) t |s(w,j) t )]

πθ(ˆa(i) t |ˆs(i) t )

πφ(ˆa(i) t |ˆs(i) t )

j=1 log[πφ(ˆa(j) t |ˆs(j) t )] ;

φµS l (φ, θ) =

t=1 λN(s(l) t ) |A(l) t | Y

πθ(a(l,i) t |s(l,i) t )

πφ(a(l,i) t |s(l,i) t )

|A(l) t | X

j=1 log[πφ(a(l,j) t |s(l,j) t )]

πθ(ˇa(i) t |ˇs(i) t )

πφ(ˇa(i) t |ˇs(i) t )

j=1 log[πφ(ˇa(j) t |ˇs(j) t )] ,

where we also took the logarithmic scaling for the gradients in their implementation, e.g.,

|A(w) t | Y

πθ(a(w,i) t |s(w,i) t )

πφ(a(w,i) t |s(w,i) t )

|A(w) t | Y

i=1 exp log(πθ(a(w,i) t |s(w,i) t )) log(πφ(a(w,i) t |s(w,i) t )) (44)

and so do others.

To train our policy networks in GSM8K, Game24, and MATH, we propose the preference data reconfigured from their training set. For each question x with correct response y, we compute log πref(y |x) across all responses y from another

Language Models as Implicit Tree Search

question to generate the hard negative preference pairs, then take them to initialize πθ via c DPO. We set ϵ=0 to human alignment task, yet set ϵ=0.25 to mathematical reasoning and planning tasks. As for tree-based decoding, we employ the same tree-width pruning strategy in (Feng et al., 2024), difference only rises from the deterministic decision making or stochastic decision making. In GSM8K, we also provided the component analysis by varying ϵ in the range {100%, 75%, 50%, 25%, 0%}, along with the results ranged from 53.2, 52.9, 53.4, 51.8, 50.1, accordingly. The performance drastically drops when the ratio less than 50%.

D. LLM-based Reasoning Experiments Beyond Mathematics

We offer evaluation based on Proof Writer (Tafjord et al., 2020) for deductive logical reasoning, and Chess Endgame (Abdulhai et al., 2023) for long-term multi-turn decision making. For Proof Writer, we follow (Pan et al.) to generate the test set, then the rest are merged to 41,433 training instances. All training and test instances employed the prompt template in (Pan et al.) that initiated the start of Co T, then we employ LLAMA2-7B as the base model and all fine-tuning methods only run for a single epoch. For Chess Endgame, we follow the experimental setup in (Feng et al., 2023). With regards to each prompt-response pair (x, y(w)), in Proof Writer and Chess Endgame, we find the dispreferred response y(l)

using the same strategy in our mathematical reasoning tasks. We ensure the evaluation in the fair comparison with the Co T and LLM-based tree-search baselines : Co T-greedy, BFS-V, MCTStr, MCTS-rollout, Co T-SC-MAJ, Co T-SC-ORM, BFS-V-ORM, MCTS-ORM, whose implementations are consistent in the paper.

For simplicity, we skip the average token number metric to highlight Acc in Proof Writer and Win rate in Chess Endgame. While their results remain based on Path@1 to promise the computation efficiency, and Equal-Token to encourage the comparison in the similar scale of computation consumption cross baselines. In the table, we found that Co T variants almost fail in Proof Writer due to their performances close to random guess (33.33%). MCTS variants obtain significantly better results yet basically under-perform ITS variants with substantial gap in ACC, probably due to the cold-start effect in MCTS learned with one epoch. As for Chess Endgame, ITS variants almost solve the problems with Win rates 99.83% in Path@1 and 98.57% in Equal-Token. It proves ITS also competitive in long-horizon reasoning.