# language_models_as_implicit_tree_search__9f0c07cb.pdf Language Models as Implicit Tree Search Ziliang Chen 1 Zhao-Rong Lai 2 Yufeng Yang 3 Liangda Fang 2 Zhanfu Yang 4 Liang Lin 1 3 Despite advancing language model (LM) alignment, direct preference optimization (DPO) falls short in LM reasoning with the free lunch from reinforcement learning (RL). As the breakthrough, this work proposes a new RL-free preference optimization method aiming to achieve DPO along with learning another LM, whose response generation policy holds the asymptotic equivalence with Alpha Zero-like search, the apex of algorithms for complex reasoning missions like chess Go. While circumventing explicit value and reward modeling, the neural implicit tree search executed by the extra LM remains seeking to equip DPO with reasoning procedure technically akin to Alpha Zero. Our experiments demonstrate that our methodology outperforms both regular DPO variants in human preference alignment, and MCTS-based LMs in mathematical reasoning and planning tasks. 1. Introduction Preference optimization paradigms, notably Reinforcement Learning from Human Feedback (RLHF) (Bai et al., 2022) and Direct Preference Optimization (DPO) (Rafailov et al., 2024b), are foundational to modern Language Model (LM) alignment, aiming to imbue these models with human s behavioral characteristics, inclination, and value. In particular, DPO circumvent reward modeling so as to directly optimize LMs with preferential response pairs, thereby reinforcement learning (RL) s complexities and instability avoided to yield more efficiently and stably aligned models. Distinct from preference alignment, sophisticated reasoning capabilities underlying human brains are not typically 1Research Institute of Multiple Agents and Embodied Intelligence, Peng Cheng Laboratory; 2Jinan University, Guangzhou, China; 3School of Computer Science and Engineering, Sun Yatsen University, Guangzhou, China; 4Department of Computer Science, Rutgers University. Correspondence to: Liang Lin . Proceedings of the 42 nd International Conference on Machine Learning, Vancouver, Canada. PMLR 267, 2025. Copyright 2025 by the author(s). replicated by RLHF and DPO alone: the emulation often necessitates synergistic integration with other techniques, i.e., meticulous prompt engineering (Wei et al., 2022), processbased reward modeling (Lightman et al., 2023), and Monte Carlo Tree Search (MCTS) (Xie et al., 2024a). Among these complementary approaches, MCTS stands out as essentially promising, due to its exploration of complex decision spaces in achieving high-level reasoning tasks such as game Go (Silver et al., 2017b). Massive leading research were proposed to combine MCTS with preference optimization, whereas their MCTS procedures inevitably require the value function learned by RL for executing the search principles of Upper Confidence Bound (UCB). It is noteworthy that the critical advantage of DPO is to skip the value learning to achieve faster and more stable preference optimization than RLHF. The standard MCTS s reliance on an RL-learned value function directly conflicts with DPO s core strength of bypassing RL entirely, showing a significant hurdle for their seamless integration. It implies the necessity to reconcile MCTS methodologies with the RL-free nature behind DPO. Rather than the regular MCTS, this paper started from its theoretical variant derived from (Grill et al., 2020b), where Alpha Zero-like MCTS can be treated as a stochastic policy solved by the state-specific local optimization regularized by the reverse KL-divergence. With this regard, each statespecific local tree search decision asymtotically converges to Alpha Zero-like MCTS decision with the gap bounded by the empirical visit counts, thus, lifting the expressiveness of MCTS limited by the integer-count probability and sparse tree width. Regretfully, the local solution of this stochastic policy involves the parameter only solvable by dichotomic search per state. In terms of the exponentially complex state space in token-level language generation, this implicit tree search (ITS) is hardly applied in MCTS-based LM research. Contributions. We propose a new RL-free preference optimization paradigm aiming to approximate ITS by LMs. Specifically, Beyond the original token-selection policy in DPO, we propose another LM to learn the global policy of ITS with regards to neural universal approximation. The preference optimization built on top of DPO, without dichotomic search and any other RL elements, inherits the critical advantages of the DPO family. Language Models as Implicit Tree Search In terms of the gradient analysis to the preference optimization with ITS, learning its global policy yields the response generation with diverse and better aligned preference, therefore, simultaneously benefits AI alignment and reasoning tasks. Self-improved preference policy augmentation and decoding strategy are proposed for our ITS-based preference optimization approaches, in order to align with the MCTS-based LLM decoding. The connection between the decoding strategies and group-relative policy optimization (Shao et al., 2024) has also been presented. Our experiments included the evaluation across human preference alignment, mathematical reasoning, and mathematical planning, where our approach concurrently reaped the optima against DPO variants and MCTS-derived baselines. 2. Preliminaries For language generation, language models (LM) θ serve as a token-level Markov decision process (MDP) (S,A,P,R,T ), where the state st = (x, y 0 controls their trade off. DPO (Rafailov et al., 2024b) reconsiders the optimization from a interpretation of R(x, y) represented as R(x, y) = β log πθ(y|x) πref(y|x) + β log Z(x). With such regards, Eq.2 is equivalent with LDPO(πθ; πref) = E(x,y(w),y(l)) D log σ β log πθ(yw|x) πref(yw|x) β log πθ(yl|x) (4) therefore the reward modeling phase is skipped to directly optimize the token-selection policy πθ. Token-level Interpretation of LM Alignments. The aforementioned LM alignments focus on a complete response y, yet πθ is executed in each MDP step using token-level rewards r(st, at) with β log πθ(yt|[x,y 0 is a balance factor between exploration and exploitation in the MCTS search space AMCTS. The fourstage procedure of Eq.7 is shown in Appendix.A. 3. Implicit Tree Search: Warm-Up Alpha Zero-like MCTS variants have been intensively explored in leading researches for LM training (Feng et al., 2023; Zhang et al., 2024a) and alignment (Xie et al., 2024b; Chen et al., 2024a), in order to mitigate their downsides of reasoning and long-form generation ability captured in posttraining. In spite of different algorithm designs, the methodologies are consistent with MCTS in the data-generation manner: ˆπ merges with the decoding strategy to generate responses for training data augmentation, then the tokenselection policy πθ and visit counts updated along with these data, further renew the decoding by Eq.7. Despite significantly advancing LM alignments, the decision-making pipeline relies on the online state-action visit count n(s, a) per state-action pair. In terms of the integer-count probability and the sparse tree width in the initial training iters, it inevitably suffers from two critical problems. Cold-start exploration expressiveness. P a A n(s,a ) 1+n(s,a) is the key to differentiate ˆπ and π, so Alpha Zero-like tree search typically employs its empirical visit distribution 1+n(s,a) |AMCTS|+P a AMCTS n(s,a ) or its exponential generalization 1+ n(s,a ) r a AMCTS n(s,a ) r to explore the tree structure. Whereas the discrete nature of visit counts limit the expressiveness of exploration strategy. Proposition 3.1. Given a fixed r R/{0}, the action selection probability by the exponential empirical visit distribution holds its value with the equal cardinality of Z. Proposition 3.2. Given a learnable r R, the action selection probability by the exponential empirical visit distribution holds the value in the range ( 1 |AMCTS|, 1). The problem is cold-start because early decisions heavily influences the tree growth yet the ratios of integers are more unstable when the visit counts are small. It may drive the language generation into bias or sub-optima. Sparse-action improvement. ˆπ solely improved for actions searched at least once by the previous simulation whereas those with n(s, a)=0 would be dominated by θ. In terms of the large action space in language and the deterministic policy in Eq.7, it may cause a large simulation budget to improve actions never visited. Recognizing that these limitations are intrinsic to the use of online visit counts in conventional MCTS algorithms, we supersedes the online tree-search strategy with stochastic policy optimization established in (Grill et al., 2020b) (depicted in Figure.1.b), named Implicit Tree Search since the optimized policy is provably equivalent with the Alpha Zerolike tree search exploration defined in Eq.7. 3.1. MCTS as Regularized Policy Optimization More specifically, for any state, the action selection formula in Eq.7 holds the identical interpretation a (s) arg max a AMCTS Qπ(s, a) + λN(s) π(a|s) where ˆπ(a|s) 1+n(s,a) |AMCTS|+P a AMCTS n(s,a ) represents the empirical visit distribution and λN(s) c q P a AMCTS n(s,a ) a AMCTS n(s,a ) denotes the state-specific multi- plier. Notice that for any s S, the empirical visit distribution ˆπ( |s) is the only way that search algorithm influences the optimal action of tree search. In language generation context, we set π=πθ, so for any s S, ˆπ( |s) holds a corresponding predominant policy π( |s) as Theorem 3.3. (Asymptotic equivalence between ITS and MCTS policies) s S, let π( |s) be the solution of π( |s) arg max y(s) S h Qπθ(s) y(s) λN(s)DKL[πθ( |s), y(s) i (9) where S denotes the |AMCTS|-dimensional simplex, and Qπθ(s) = Qπθ(s, a1), Qπθ(s, a2), , Qπθ(s, a|AMCTS|) is the Q-value vector with respect to the state s. Then as the visit counts increase, the empirical visit distribution ˆπ( |s) in Eq.8 converges to π( |s) with the upper bound ˆπ( |s) π( |s) |AMCTS| + 1 |AMCTS| + N (10) where N indicates the total rounds of simulation. Theorem.3.3 is concluded from the definition.1 and proposition.1,5 in (Grill et al., 2020b). The theoretical result verifies that, given sufficient simulation steps, there exists an asymptotic equivalence between the exploration of Alpha Zero-like tree search ˆπ( |s) and the stochastic-sampling policy π( |s). Note that the policy optimization in a simplex allows the Language Models as Implicit Tree Search Figure 1. Comparison between the diagrams of (a).MCTS (regular MCTS and Alpha Zero-like tree search), (b). Implicit Tree Search (ITS) (Grill et al., 2020b), and our IT-PO algorithm. continuous value change in π( |s) that prevents the risk of cold-start expressiveness. Besides, the reversed KL divergence is smooth on π( |s), ensuring the sparse-action improvement resolved. Accordingly, the expand and backup stages along with the MCTS procedure require the state-specific stochastic policy optimization formulated as follows Lemma 3.4. (Solution of ITS policy (Grill et al., 2020b)) s S, the solution π( |s) of Eq.9 holds at AMCTS, π(at|st) = λN(st) πθ(at|st) α(st) Qπθ(st, at) (11) where α(st) is defined as 1).α(st) max α R, s.t. X a AMCTS π(a|st) = 1 2).α(st) α(st)min max a AMCTS Q(st, a) + λNπθ(a|st) 3).α(st) α(st)max max a AMCTS Q(st, a) + λN (12) Lemma.3.4 is derived from Appendix.B.3 in the paper. 3.2. Challenges of Combing Implicit Tree and LMs Observed that P a AMCTSπ(a|st) monotonically decreases on (α(st)min,α(st)max), the theoretical result guarantees the state-specific hyper-parameter α(st) uniquely identified using dichotomic search over (α(st)min, α(st)max). In other words, π can not be universally optimized by gradient descent and even for each state, π( |st) s solution in Eq.11 is not exactly closed-form. To this end, π can not be flexibly used as explicit search like regular MCTS algorithms and instead, only applicable for policy distillation to update the Q function in Alpha Zero-like MCTS. However, it is hardly applied for LLM-based preference optimization. The vital problem roots in the exponentially increase size of the valid state space. As demonstrated in token-level MDP, texts are generated by sequential token selection with the state st = (x, y 0 at the token level or Uas(x, yi, yj) > 0 at the sentence level. After ranking their value by Uss or Uas, the top-M preference pairs are selected for each prompt x in D. Then we collect them across all prompts in D to construct the ITS improved dataset D+. They join with D to refine the tokenselection policy via learning θ with c DPO. We elaborate the IT-PO algorithm pipelines of its step-synchronous and stepasynchronous cases in our implementation in Appendix.C. Self-Improvement in Gradient Analysis. Through analyzing the TI-PO loss function s gradients with respect to φ, we demonstrate how πφ facilitates the generated responses both diverse and better aligned with ground-truth preferences. For clarity, our analysis is derived from the stepsynchronous ITS using Uss for preference alignment and focuses on the positive preference logit in log[σ(Uss)]: log[σ(Uss(x, y(w))), y(l)] =( φµw φµl) R log σ( R) | {z } higher when reward estimate is wrong by φ t=1 λN(s(w) t ) πθ(a(w) t |s(w) t ) φ log[πφ(a(w) t |s(w) t )] πφ(a(w) t |s(w) t ) πθ(ˆat|s(w) t ) φ log[πφ(ˆat|s(w) t )] πφ(ˆat|s(w) t ) t=1 λN(s(l) t ) πθ(a(l) t |s(l) t ) φ log[πφ(a(l) t |s(l) t )] πφ(a(l) t |s(l) t ) + πθ(ˇat|s(l) t ) φ log[πφ(ˇat|s(l) t )] πφ(ˇat|s(l) t ) (19) Compared with DPO, the second multiplier observed from the decomposed formula achieves the identical gradient effects. φ log[πφ(a(w) t |s(w) t )] and φ log[πφ(a(l) t |s(l) t )] similarly exist in the first term yet they take the likelihood increase and decrease effects with coefficients πθ(a(w) t |s(w) t ) πφ(a(w) t |s(w) t ) and πθ(a(l) t |s(l) t ) πφ(a(l) t |s(l) t ). It implies that when πφ <πθ to the ground- truth preference pairs, IT-PO takes the aggressive likelihood influence to align πφ and πθ while the conservative likelihood influence if πφ >πθ. More importantly, the implicit tree sampled from the ground-truth preference nodes, i.e., ˆat πφ( |s(w) t ), ˇat πφ( |s(l) t ) also influence the likelihood: φ log[πφ(ˆat|s(w) t )] and φ log[πφ(ˇat|s(l) t )] demonstrates that on account of the ground-truth preference nodes, πφ learns to search lower preference child nodes from s(w) t yet higher preference child nodes from s(l) t with the same conservative-aggressive strategy. Hence πφ generates certain responses from the preferred context yet explore more diverse responses when the context is not preferred. ITS-guided Decoding. Beyond self-enhanced training, recent study on LLM-based MCTS (Feng et al., 2023) demonstrates promising results in improving the decoding process by MCTS. Motivated from this, we propose the stochastictree variant decoding from the spirits of their MCTS-α and MCTS-rollout strategies. 1). ITS-α. For each initial state xroot, the original MCTS-α Language Models as Implicit Tree Search decoding strategy applied Alpha-like tree search for policy evaluation then backup the visit count n(s, a) of the expo- nential visit distribution n(st,a)1/γ P a n(st,a )1/γ to guide decoding. Since πφ is the universal approximator of π, the stochastic variant of Alpha-like tree search. It is straightfoward to use the exponential version of πφ, exp(log[πφ(a|st)]/γ) P a exp(log[πφ(a |st)]/γ) (step-synchronous IT-PO) or exp(log[πS φ(a|st)]/γ) P a exp(log[πS φ(a |st)]/γ) (stepasynchronous IT-PO) in our decoding strategy. 2). ITS-rollout. When the root xroot and the intermediate state nodes sufficiently differ from the states visited in training, the simulation rounds N behaves more closely as zero to fail the approximation between ˆπ and π in Theorem.3.3. In this case, πφ need to be updated to adapt the state xroot as the backup process in MCTS. Due to no value function available, we employ πθ to implicitly evaluate arbitrary pairs of responses y1,y2 generated by πφ( |x), then the sampled responses with preferences evaluated by θ would join the meta-update of πφ to the state x φ φ (x,y1 y2) φ Lsa-IT-PO s.t. y1 y2 if uθ(x, y1, y2) > 0, (20) which refreshes exp(log[πφ (a|st)]/γ) P a exp(log[πφ (a |st)]/γ) to facilitate our decoding strategy. ITS-rollout can be treated as test-timetraining version of ITS-α. Its implementation first updates φ by (20), then use ITS-α with φ to facilitate the decoding process. In our experiment, 8 responses for each test prompt were generated to achieve self-training, resulting the extra 0.5 1 hour as the decoding warm-up stage. Provided the decoding probabilities obtained by ITS-rollout, the BF-S strategy (Breath-first Search) can be applied based on the implicit advantage function solely approximated by πφ and πθ. More specifically, Aπθ(a, s) =Qπθ(a, s) E[Qπθ(a , s)|a πθ( |s)] Qπθ(a, s) 1 a πθ( |s) Qπθ(a , s) = α(s) λN(s)πθ(a|s) α(s) λN(s)πθ(a |s) = λN(s) πθ(a|s) λN(s) πθ(a|s) πθ(a |s) πφ(a |s) (21) where N indicates how many responses drawn from πθ to approximate the advantage function. Such decoding strategy is closely related with wisdom of group-relative policy optimization (GRPO), a well-known value-free RL algorithm employed to train Deepseek (Shao et al., 2024). Figure 2. Win rate measured by GPT-4 via the consistent prompts in previous studies: (a) Win rate of baselines decoding with different temperatures; (b) Win rate of πθ and πφ across the alternative phases of their preference policy distillation. 5. Experiments In this section, we demonstrate the superiority of IT-PO from the step-synchronous (Theorem.4.2) and step-asynchronous (Theorem.4.3) perspectives. In the step-synchronous cases, IT-PO trains πφ to perform token-level ITS in order to provide the fine-grained human preference alignment; in the step-asynchronous scenarios, we evaluate πφ trained by ITPO to perform sentence-level search in step-asynchronous scenarios, i.e., mathematical reasoning and planning tasks where LLM-based MCTS algorithms are extensively used. 5.1. Experiments on Token-level Preference Alignment Anthropic HH dataset (Bai et al., 2022)1 consists of 170k dialogues between a human and an automated assistant, each of which presents as a history with alternative responses with respect to different preferences annotated by humans. We conduct the conventional evaluation setup using Pythia 2.8 (Biderman et al., 2023) as the base model, then each dialogue with its preferred completion in Anthropic HH training set is incorporated for supervised fine-tuning to derive three LMs: the parameter-frozen πref and parameterinitialized πθ, πφ. In terms of preference alignment on token-level MDP, SS-IT-PO with respect to the LMs θ, φ are alternatively trained, then the IT-PO variants both compared with token-level DPO-family baselines, i.e., DPO (Rafailov et al., 2024a), RTO (Zhong et al., 2024a), TDPO1 and TDPO2 (Zeng et al., 2024a). The experiment primarily aims for three evaluation metrics: 1). Accuracy: we adopt the evaluation split in (Zeng et al., 2024a) to train all models then evaluate their performance 1https://huggingface.co/datasets/Anthropic/hh-rlhf Language Models as Implicit Tree Search Table 1. Comparison in terms of the trade-off between Alignment (accuracy) and Diversity (entropy) on the Anthropic HH dataset. The indicates higher values are preferable. Method Alignment Diversity Acc (%) Ent DPO 59.43 3.196 RTO 61.43 3.314 TDPO1 60.08 4.727 TDPO2 67.33 4.915 IT-PO θ (ours) 67.75 4.564 IT-PO φ (ours) 69.12 5.315 in terms of the accuracy on the generated responses relative to chosen completions in the test dataset; 2). Diversity: Nucleus sampling with p = 0.95 to generate 25 responses then the predictive entropy across the responses indicates the generation diversity. 3).Win rate: all baseline approaches are evaluated through GPT-4 against the chosen responses in the test set, so that > 50% implies the human preference alignment achieved in their performances. Both LMs θ and φ utilize ITS-α to enhance the decoding strategy. The exponential rate γ =1 consistently across the evaluation of three metrics. This choice aligns with MCTSα, which employs token-level tree search in alignment tasks. Accuracy and Diversity across the baselines are presented in Table 5.1. TDPO2 demonstrates competitive performance, achieving alignment accuracy comparable to IT-PO θ, even exhibiting superior diversity. On the other hand, IT-PO φ, trained as the stochastic tree policy, outperforms TDPO2 by a significant margin in both alignment and diversity metrics. It is because that LM θ in our framework performs more likely as a policy evaluator trained to boost the evolution of stochastic tree policy πφ. To verify our assumption, we further observe their Win rate illustrated in Figure.2(a). Through scaling the temperature during inference, T-DPOs are found quite sensitive to the temperature, with lower performance than most baselines when the temperature value are extreme. Instead, both ITPO θ and IT-PO φ consistently performs with robust value over 50% in 7 out of 10 cases where the other baselines underperform or even fail in the qualified Win rate level. In Figure.2 (b), we further observe the Win rate progress when IT-PO θ and IT-PO φ by through their alternative policy distillation. When no alternative strategy used (iter = 1), the win rate of IT-PO θ and IT-PO φ are solely on par with the win rates of DPO and RTO, respectively, which largely underperform T-DPO variants. While after five iterations of alternative preference policy distillation, Win rates in ITPO θ and IT-PO φ are both incredibly improved to exceed their early, even their teachers performances. These observations are consistent with our analysis to φ s gradients. Table 2. Performance comparison of different methods on GSM8k and Game24 datasets Setting Method Performance(%) / # Tokens GSM8k Game24 Co T-greedy 41.4 98 12.7 76 BFS-V 52.5 485 64.8 369 MCTS-α 51.9 561 63.3 412 MCTS-Rollout 47.8 3.4k 71.3 670 ITS-α (ours) 53.2 561 67.6 380 ITS-Rollout (ours) 51.6 3.8k 73.2 646 Equal-Token Co T-SCMAJ 46.8 500 14.6 684 Co T-SCORM 52.3 500 50.6 684 BFS-VORM - - 70.90 1.6k MCTSORM - - 69.34 649 ITS-α (ours) - - 70.64 1.6k ITS-Rollout (ours) - - 71.42 698 5.2. Experiments on Mathematical Tasks Our second experimental suite includes the tasks of mathematical reasoning on GSM8K (Cobbe et al., 2021), MATH (Hendrycks et al., 2021) and mathematical planning on Game24 (Cobbe et al., 2021). Although the benchmarks are famous in evaluating LLM s reasoning capability by PRM (process reward model) (Lightman et al., 2023) and MCTSbased methods (Feng et al., 2023), they are nontrivial for DPO family since explicit value function can not be skipped to execute MCTS reasoning strategies. Baselines and Evaluation. Since most DPO-based approaches can not adapt to LLM reasoning without explicit value modeling, we are more interested in the comparison between TI-PO and state-of-the-art tree-search (TS) LLM and Chain-of-Thought (Co T) (Wei et al., 2022) baselines in (Feng et al., 2023). More specifically, the evaluated baselines for GSM8K and Game24 are derived from LLAMA-7b 2 base model, specifically include Co T-greedy (greedy value search by Co T), BFV (Breath-first search with their learned value function), MCTS-α (Alpha Zero-like tree search with their learned value function), MCTS-rollout (MCTS-α variant that allows the backup process happen in the intermediate step). As a counterpart, we incorporate the policies πθ and πφ with their LMs fine-tuned by c DPO (Eq.13) and step-asynchronous IT-PO objective (Eq.17), respectively, then achieve their alternative post-training via preference policy augmentation. After five iterations, the policy models jointly facilitate the decoding processes via ITS-α and ITSrollout strategies. All baselines are evaluated in the single path setup via Path@1 and Equal-Token, the latter try to 2https://huggingface.co/meta-llama/Llama-2-7b Language Models as Implicit Tree Search Table 3. Path@1 metric on Game24 with different node size. Method Performance(%) / # Tokens width=6 width=20 width=50 MCTS-α 41.6 243 63.3 412 74.5 573 MCTS-Rollout 43.8 401 71.3 670 80.7 833 BFS-V 43.2 206 64.8 370 74.6 528 ITS-α (ours) 46.1 267 67.6 380 75.0 647 ITS-Rollout (ours) 48.9 489 73.2 646 81.8 954 compare the results with the similar scale computation consumption. For the evaluation on MATH, we construct the training set integrated with the training splits of GSM8K and MATH, then consider Qwen1.5-32B (Team, 2024) as our base model. Beyond this, we also introduced LLAMA3.18B as an alternative of φ to verify whether φ can be replaced by a smaller LLM. To this, we have the evaluated LLMs (θ,φ) defined by (Qwen1.5-32B,Qwen1.5-32B) and (Qwen1.5-32B,LLAMA3-8B), respectively. We also employed greedy Co T (3 shots), MCTS-α, and MCTS-rollout for their comparison. Implementation. Distinct from human preference data with pairwise responses, reasoning tasks only consist of question prompts and its correct solution responses. It motivates us to reconfigure their data to adapt our preference optimization regime. Beyond this, since our tree-search strategy is indeed a stochastic policy, the search depth and breadth for training LLM φ were not limited in all datasets, while their decoding procedures i.e., ITS-α and ITS-rollout, inherited the pruning setup derived from the MCTS baselines to address the heavy computation while the tree search runs for reasoning-oriented inference. Details refer to Appendix.C. Results.According to the GSM8k results (Table 2), when examining Path@1 performance across all baseline methods, tree-search algorithms (excluding our proposed methods) don t show significant advantages over standard Co T approaches. MCTS variants actually perform worse than BFV, despite their higher computational requirements. However, the Game24 results tell a different story. In this task, Co T approaches perform poorly, largely because Game24 s structure allows for wider and deeper search trees, which better showcases the strengths of tree-search algorithms (Table B.5). But interestingly, ITS-α and ITS-rollout consistently outperform other approaches, regardless of the default tree width and depth limits during inference. This superior performance can be attributed to our IT-PO training regime. Unlike traditional approaches that slowly build sparse trees based on visit counts, our approximated π encourages broader exploration across the entire action space. This approach is computationally efficient since ITPO only needs to sample the next sentence (rather than the Table 4. The experimental results in MATH. All baselines employed Qwen1.5-32B as their base models. Baselines/Decoding greedy Co T -α -rollout Qwen1.5-32B 36.1 - - MCTS- (Qwen1.5-32B) - 36.0 36.7 ITS- (φ=Qwen1.5-32B) - 39.8 40.2 ITS- (φ=Qwen1.5-32B) - 37.9 38.2 complete reasoning path) from preferred/dispreferred contexts (as defined by ˆAt and ˇAt in Eq. 4.3). This enables faster search and backup operations compared to standard MCTS. As a result, even with bounded search width and depth during inference, the stochastic policy demonstrates robust performance due to effective training simulation. In order to support our claim, we further ablate the maximum tree width and node size during inference to observe the performance variation across different baselines. As shown in Table.3,4, expanding the node size and tree width significantly boosts regular TS-LLM performances. While ITS-α and ITS-Rollout have already achieved impressive results with the small node size and and tree width. Beyond this, they also enjoy the performance growth along with the increasing exploration space. In Table.5, we presented the evaluation in MATH that contains more difficult mathematical reasoning tasks. In terms of the greedy Co T results in LLAMA3-8B (20.5), some observations can be found in the table. First, we note that the MCTS strategy derived from GSM8K is unreliable in MATH, probably due to the conflict between the problem complexity and limited search width and depth during training, while ITS-α and ITS-rollout did not suffer from this problem. Second, with a weaker model (LLAMA3-8B) to implement LLM φ, IT-PO still enabled the improvement of LLM-based reasoning. It implies the feasibility of ITPO. Specifically, despite avoiding value modeling, IT-PO introduces the extra LM φ instead of the single LM θ in standard DPO variants, leading to the increased computational and memory requirement. While the evidences in Table.5 demonstrated that using LM φ with the significantly smaller and weaker base model than LM θ, ITS-(θ =Qwen1.5-32B, φ = LLAMA3-8B) still yields the results very competitive with ITS-(θ = Qwen1.5-32B, φ = Qwen1.5-32B). 6. Concluding Remark This paper presents a new RL-free methodology to equip DPO with MCTS interpreted as stochastic policy to better align LMs with human preferences, and simultaneously outperforms MCTS-based LLM reasoning baseline methods in both mathematical reasoning and planning tasks. Language Models as Implicit Tree Search Acknowledgement The research was supported in part by Guangdong S&T Programme (Grant No. 2024B0101010003); in part by The Major Key Project of PCL (No. PCL2024A04, PCL2025A02); in part by National Natural Science Foundation of China (NSFC) under Grant No.62206110, 62176103, 62377208, and 62276114; in part by the Science and Technology Planning Project of Guangzhou under grants 2024A04J9896, 2025A03J3565. In particular, we thank all the reviewers for their constructive suggestions that help to improve this work. Impact Statement This work advances preference-based learning for large language models through improved exploration mechanism, which has implications for AI alignment and safer deployment of language technologies. By enhancing LLMs reasoning and long-form generation capabilities through more principled exploration and preference learning, our approach could lead to more reliable and controllable language models. However, improved reasoning capabilities could also enable more sophisticated text generation that may be misused. The integration of MCTS with preference learning represents a step toward more transparent optimization of language model behavior, though careful consideration should be given to the selection of preference data to avoid encoding harmful biases. We believe the technical advances presented here can contribute positively to the development of more capable and aligned language models when deployed thoughtfully with appropriate safeguards and oversight. Abdulhai, M., White, I., Snell, C., Sun, C., Hong, J., Zhai, Y., Xu, K., and Levine, S. Lmrl gym: Benchmarks for multi-turn reinforcement learning with language models. ar Xiv preprint ar Xiv:2311.18232, 2023. Amini, A., Vieira, T., and Cotterell, R. Direct preference optimization with an offset, 2024. URL https: //arxiv.org/abs/2402.10571. Bai, Y., Jones, A., Ndousse, K., Askell, A., Chen, A., Das Sarma, N., Drain, D., Fort, S., Ganguli, D., Henighan, T., et al. Training a helpful and harmless assistant with reinforcement learning from human feedback. ar Xiv preprint ar Xiv:2204.05862, 2022. Biderman, S., Schoelkopf, H., Anthony, Q. G., Bradley, H., O Brien, K., Hallahan, E., Khan, M. A., Purohit, S., Prashanth, U. S., Raff, E., et al. Pythia: A suite for analyzing large language models across training and scaling. In International Conference on Machine Learning, pp. 2397 2430. PMLR, 2023. Bradley, R. A. and Terry, M. E. Rank analysis of incomplete block designs: I. the method of paired comparisons. Biometrika, 39(3/4):324 345, 1952. Browne, C. B., Powley, E., Whitehouse, D., Lucas, S. M., Cowling, P. I., Rohlfshagen, P., Tavener, S., Perez, D., Samothrakis, S., and Colton, S. A survey of monte carlo tree search methods. IEEE Transactions on Computational Intelligence and AI in games, 4(1):1 43, 2012. Chen, G., Liao, M., Li, C., and Fan, K. Step-level value preference optimization for mathematical reasoning. ar Xiv preprint ar Xiv:2406.10858, 2024a. Chen, G., Liao, M., Li, C., and Fan, K. Step-level value preference optimization for mathematical reasoning, 2024b. URL https://arxiv.org/abs/2406.10858. Chen, Z., Wang, K., Wang, X., Peng, P., Izquierdo, E., and Lin, L. Deep co-space: Sample mining across feature transformation for semi-supervised learning. IEEE Transactions on Circuits and Systems for Video Technology, 28 (10):2667 2678, 2017. Chen, Z., Huang, X., Guan, Q., Lin, L., and Luo, W. A retrospect to multi-prompt learning across vision and language. In Proceedings of the IEEE/CVF international conference on computer vision, pp. 22190 22201, 2023. Chen, Z., Zheng, Y., Lai, Z.-R., Guan, Q., and Lin, L. Diagnosing and rectifying fake ood invariance: A restructured causal approach. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pp. 11471 11479, 2024c. Cobbe, K., Kosaraju, V., Bavarian, M., Chen, M., Jun, H., Kaiser, L., Plappert, M., Tworek, J., Hilton, J., Nakano, R., et al. Training verifiers to solve math word problems, 2021. URL https://arxiv. org/abs/2110.14168, 2021. Coulom, R. Efficient selectivity and backup operators in monte-carlo tree search. In International conference on computers and games, pp. 72 83. Springer, 2006. Dong, H., Xiong, W., Goyal, D., Zhang, Y., Chow, W., Pan, R., Diao, S., Zhang, J., Shum, K., and Zhang, T. Raft: Reward ranked finetuning for generative foundation model alignment, 2023. URL https://arxiv.org/ abs/2304.06767. Dong, H., Xiong, W., Pang, B., Wang, H., Zhao, H., Zhou, Y., Jiang, N., Sahoo, D., Xiong, C., and Zhang, T. Rlhf workflow: From reward modeling to online rlhf. ar Xiv preprint ar Xiv:2405.07863, 2024. Language Models as Implicit Tree Search Engstrom, L., Ilyas, A., Santurkar, S., Tsipras, D., Janoos, F., Rudolph, L., and Madry, A. Implementation matters in deep policy gradients: A case study on ppo and trpo, 2020. URL https://arxiv.org/abs/2005.12729. Feng, X., Wan, Z., Wen, M., Mc Aleer, S. M., Wen, Y., Zhang, W., and Wang, J. Alphazero-like tree-search can guide large language model decoding and training. ar Xiv preprint ar Xiv:2309.17179, 2023. Feng, X., Wan, Z., Wen, M., Mc Aleer, S. M., Wen, Y., Zhang, W., and Wang, J. Alphazero-like tree-search can guide large language model decoding and training, 2024. URL https://arxiv.org/abs/2309.17179. Grill, J.-B., Altch e, F., Tang, Y., Hubert, T., Valko, M., Antonoglou, I., and Munos, R. Monte-carlo tree search as regularized policy optimization. pp. 3769 3778, 2020a. Grill, J.-B., Altch e, F., Tang, Y., Hubert, T., Valko, M., Antonoglou, I., and Munos, R. Monte-carlo tree search as regularized policy optimization. In International Conference on Machine Learning, pp. 3769 3778. PMLR, 2020b. Hao, S., Gu, Y., Ma, H., et al. Reasoning with language models is planning with a world model. ar Xiv preprint ar Xiv:2305.14992, 2023. Hendrycks, D., Burns, C., Kadavath, S., Arora, A., Basart, S., Tang, E., Song, D., and Steinhardt, J. Measuring mathematical problem solving with the math dataset. ar Xiv preprint ar Xiv:2103.03874, 2021. Kirk, R., Mediratta, I., Nalmpantis, C., Luketina, J., Hambro, E., Grefenstette, E., and Raileanu, R. Understanding the effects of rlhf on llm generalisation and diversity. ar Xiv preprint ar Xiv:2310.06452, 2023. Kocsis, L. and Szepesv ari, C. Bandit based monte-carlo planning. In European conference on machine learning, pp. 282 293. Springer, 2006. Liao, W., Chu, X., and Wang, Y. Tpo: Aligning large language models with multi-branch & multi-step preference trees. ar Xiv preprint ar Xiv:2410.12854, 2024. Lightman, H., Kosaraju, V., Burda, Y., Edwards, H., Baker, B., Lee, T., Leike, J., Schulman, J., Sutskever, I., and Cobbe, K. Let s verify step by step. ar Xiv preprint ar Xiv:2305.20050, 2023. Liu, A., Bai, H., Lu, Z., Sun, Y., Kong, X., Wang, S., Shan, J., Jose, A. M., Liu, X., Wen, L., Yu, P. S., and Cao, M. Tis-dpo: Token-level importance sampling for direct preference optimization with estimated weights, 2024a. URL https://arxiv.org/abs/2410.04350. Liu, Z., Lu, M., Zhang, S., Liu, B., Guo, H., Yang, Y., Blanchet, J., and Wang, Z. Provably mitigating overoptimization in rlhf: Your sft loss is implicitly an adversarial regularizer, 2024b. URL https://arxiv.org/ abs/2405.16436. Mitchell, E. A note on dpo with noisy preferences & relationship to ipo. ar Xiv preprint ar Xiv:2304.12345, 2023a. Mitchell, E. A note on dpo with noisy preferences & relationship to ipo, 2023b. Ouyang, Y., Wang, L., Yang, F., Zhao, P., Huang, C., Liu, J., Pang, B., Yang, Y., Zhan, Y., Sun, H., Lin, Q., Rajmohan, S., Deng, W., Zhang, D., Sun, F., and Zhang, Q. Token-level proximal policy optimization for query generation, 2024. URL https://arxiv.org/abs/ 2411.00722. Pan, L., Albalak, A., Wang, X., and Wang, W. Y. Logiclm: Empowering large language models with symbolic solvers for faithful logical reasoning. In The 2023 Conference on Empirical Methods in Natural Language Processing. Rafailov, R., Hejna, J., Park, R., and Finn, C. Your language model is secretly a q-function. ar Xiv preprint ar Xiv:2404.12358, 2024a. Rafailov, R., Sharma, A., Mitchell, E., Manning, C. D., Ermon, S., and Finn, C. Direct preference optimization: Your language model is secretly a reward model. Advances in Neural Information Processing Systems, 36, 2024b. Rafailov, R., Sharma, A., Mitchell, E., Manning, C. D., Ermon, S., and Finn, C. Direct preference optimization: Your language model is secretly a reward model. Advances in Neural Information Processing Systems, 36, 2024c. Rosin, C. D. Multi-armed bandits with episode context. Annals of Mathematics and Artificial Intelligence, 61(3): 203 230, 2011. Schrittwieser, J., Antonoglou, I., Hubert, T., Simonyan, K., Sifre, L., Schmitt, S., Guez, A., Lockhart, E., Hassabis, D., Graepel, T., et al. Mastering atari, go, chess and shogi by planning with a learned model. Nature, 588(7839): 604 609, 2020. Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. Proximal policy optimization algorithms. ar Xiv preprint ar Xiv:1707.06347, 2017. Shao, Z., Wang, P., Zhu, Q., Xu, R., Song, J., Bi, X., Zhang, H., Zhang, M., Li, Y., Wu, Y., et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. ar Xiv preprint ar Xiv:2402.03300, 2024. Language Models as Implicit Tree Search Silver, D., Hubert, T., Schrittwieser, J., Antonoglou, I., Lai, M., Guez, A., Lanctot, M., Sifre, L., Kumaran, D., Graepel, T., et al. Mastering chess and shogi by self-play with a general reinforcement learning algorithm. ar Xiv preprint ar Xiv:1712.01815, 2017a. Silver, D., Schrittwieser, J., Simonyan, K., Antonoglou, I., Huang, A., Guez, A., Hubert, T., Baker, L., Lai, M., Bolton, A., et al. Mastering the game of go without human knowledge. nature, 550(7676):354 359, 2017b. Tafjord, O., Mishra, B. D., and Clark, P. Proofwriter: Generating implications, proofs, and abductive statements over natural language. ar Xiv preprint ar Xiv:2012.13048, 2020. Team, Q. Introducing qwen1.5, February 2024. URL https://qwenlm.github.io/blog/qwen1. 5/. Uesato, J., Kushman, N., Kumar, R., et al. Solving math word problems with process-and outcome-based feedback. ar Xiv preprint ar Xiv:2211.14275, 2022. Wang, X., Song, L., Tian, Y., Yu, D., Peng, B., Mi, H., Huang, F., and Yu, D. Towards self-improvement of llms via mcts: Leveraging stepwise knowledge with curriculum preference learning, 2024a. URL https: //arxiv.org/abs/2410.06508. Wang, X., Song, L., Tian, Y., Yu, D., Peng, B., Mi, H., Huang, F., and Yu, D. Towards self-improvement of llms via mcts: Leveraging stepwise knowledge with curriculum preference learning, 2024b. URL https: //arxiv.org/abs/2410.06508. Wang, Y., Liu, Q., and Jin, C. Is rlhf more difficult than standard rl? a theoretical perspective. Advances in Neural Information Processing Systems, 36:76006 76032, 2023. Wei, J., Wang, X., Schuurmans, D., Bosma, M., Xia, F., Chi, E., Le, Q. V., Zhou, D., et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824 24837, 2022. Xie, Y., Goyal, A., Zheng, W., Kan, M.-Y., Lillicrap, T. P., Kawaguchi, K., and Shieh, M. Monte carlo tree search boosts reasoning via iterative preference learning. ar Xiv preprint ar Xiv:2405.00451, 2024a. Xie, Y., Goyal, A., Zheng, W., Kan, M.-Y., Lillicrap, T. P., Kawaguchi, K., and Shieh, M. Monte carlo tree search boosts reasoning via iterative preference learning. ar Xiv preprint ar Xiv:2405.00451, 2024b. Xie, Y., Goyal, A., Zheng, W., Kan, M.-Y., Lillicrap, T. P., Kawaguchi, K., and Shieh, M. Monte carlo tree search boosts reasoning via iterative preference learning, 2024c. URL https://arxiv.org/abs/2405.00451. Yao, S., Yu, D., Zhao, J., et al. Tree of thoughts: Deliberate problem-solving with large language models. ar Xiv preprint ar Xiv:2305.10601, 2023. Yuan, W., Pang, R. Y., Cho, K., Li, X., Sukhbaatar, S., Xu, J., and Weston, J. Self-rewarding language models, 2024. URL https://arxiv.org/abs/2401.10020. Yuan, Z., Yuan, H., Tan, C., Wang, W., Huang, S., and Huang, F. Rrhf: Rank responses to align language models with human feedback without tears, 2023. URL https: //arxiv.org/abs/2304.05302. Zeng, Y., Liu, G., Ma, W., Yang, N., Zhang, H., and Wang, J. Token-level direct preference optimization. ar Xiv preprint ar Xiv:2404.11999, 2024a. Zeng, Y., Liu, G., Ma, W., Yang, N., Zhang, H., and Wang, J. Token-level direct preference optimization, 2024b. URL https://arxiv.org/abs/2404.11999. Zhang, D., Zhoubian, S., Hu, Z., Yue, Y., Dong, Y., and Tang, J. Rest-mcts*: Llm self-training via process reward guided tree search. ar Xiv preprint ar Xiv:2406.03816, 2024a. Zhang, X., Du, C., Pang, T., Liu, Q., Gao, W., and Lin, M. Chain of preference optimization: Improving chainof-thought reasoning in llms, 2024b. URL https:// arxiv.org/abs/2406.09136. Zhong, H., Feng, G., Xiong, W., Cheng, X., Zhao, L., He, D., Bian, J., and Wang, L. Dpo meets ppo: Reinforced token optimization for rlhf. ar Xiv preprint ar Xiv:2404.18922, 2024a. Zhong, H., Feng, G., Xiong, W., Cheng, X., Zhao, L., He, D., Bian, J., and Wang, L. Dpo meets ppo: Reinforced token optimization for rlhf, 2024b. URL https:// arxiv.org/abs/2404.18922. Zhou, W., Zhang, S., Zhao, L., and Meng, T. T-reg: Preference optimization with token-level reward regularization, 2024. URL https://arxiv.org/abs/2412. 02685. Ziebart, B. D. Modeling purposeful adaptive behavior with the principle of maximum causal entropy. Carnegie Mellon University, 2010. Language Models as Implicit Tree Search A. Related Work The integration of search-based optimization techniques in language model alignment has significantly advanced AI alignment, with Reinforcement Learning with Human Feedback (RLHF)(Bai et al., 2022; Wang et al., 2023; Kirk et al., 2023; Dong et al., 2024) being a widely used approach. RLHF employs reward models trained from human feedback to optimize model behavior through reinforcement learning, typically using Proximal Policy Optimization (PPO) (Zhong et al., 2024b). However, RLHF has been criticized for instability, sample inefficiency, and over-optimization issues (Engstrom et al., 2020; Liu et al., 2024b; Chen et al., 2023; 2017; 2024c). To address these challenges, methods like Reward Ranked Fine Tuning (RAFT) (Dong et al., 2023) and Rank Responses to align Human Feedback (RRHF) (Yuan et al., 2023) have been proposed to refine ranking-based optimization without explicit reinforcement learning. More recently, Direct Preference Optimization (DPO) (Rafailov et al., 2024c; Amini et al., 2024) has emerged as an alternative, allowing language models to be aligned directly from human preference data without requiring a reward model. Unlike PPO, which operates within a reinforcement learning framework by optimizing policies through reward feedback, DPO reformulates the alignment problem as a supervised learning task, making policy updates more stable while maintaining alignment with human preferences. Further refinements, such as processing the paragraph at token level with methods such as TDPO (Zeng et al., 2024b), T-REG (Zhou et al., 2024), TPPO (Ouyang et al., 2024), and TIS-DPO (Liu et al., 2024a), improve efficiency by incorporating token-wise adjustments, improving preference alignment in a manner that contrasts with RLHF s reliance on policy gradient updates. Monte Carlo Tree Search (MCTS) has been extensively applied in decision-making tasks, particularly in game-playing AI, as demonstrated by its success in Alpha Go (Silver et al., 2017b), Alpha Zero (Silver et al., 2017a), and Mu Zero (Schrittwieser et al., 2020). Recent advancements have expanded its application to large language models (LLMs) for structured text generation, such as in Alpha Zero-like Tree-Search for LLMs (TS-LLM) (Feng et al., 2023). Similarly, Xie et al. (Xie et al., 2024c) propose an iterative preference learning approach, using MCTS to refine step-wise reasoning capabilities in LLMs. Further developments by Wang et al. (Wang et al., 2024b) introduce self-improvement techniques where LLMs leverage MCTS for preference-guided reinforcement learning, while Chen et al. (Chen et al., 2024b) focus on step-level value preference optimization, allowing fine-grained preference learning at intermediate steps. Zhang et al. (Zhang et al., 2024b) extend this concept with chain preference optimization, improving long-range decision-making for complex reasoning tasks. Lastly, Liao et al. (Liao et al., 2024) introduce Tree-based Preference Optimization (TPO), which integrates MCTS with preference alignment techniques to refine LLM outputs progressively. Despite these advancements, a key challenge remains: these approaches still require learning an explicit reward value function, which is critical for optimizing the search process and improving the efficiency of LLM training and inference. Our work proposes Implicit Tree Search (ITS), which integrates stochastic policy optimization with Alpha Zero-like search principles while eliminating explicit tree structures. ITS leverages reversed KL-divergence constraints and a stochastic sampling policy to enhance exploration expressiveness, addressing cold-start issues common in MCTS-based methods. Similar ideas have been explored in Monte Carlo-based regularized policy optimization (Grill et al., 2020a), (Wang et al., 2024a) they use pairwise training framework that enables LLMs to self-improve through MCTS behavior but ITS extends these principles to preference learning in language models. Furthermore, our approach aligns with broader research on AI safety and preference-based learning (Yuan et al., 2024; Xie et al., 2024a; Chen et al., 2024a; Mitchell, 2023a). ITS represents a novel search-and-learn paradigm that improves structured reasoning in language models without relying on explicit tree expansion, bridging the gap between MCTS and modern alignment techniques. C.1. Fundamentals of Monte Carlo Tree-Search Methods Monte Carlo Tree Search (MCTS) has been widely adopted as an effective strategy for solving problems requiring sequential decision-making and planning. Traditional MCTS operations, as introduced by (Kocsis & Szepesv ari, 2006) and (Coulom, 2006), include four key steps: selection, expansion, simulation, and backpropagation. However, these operations can be adapted for more advanced frameworks like Alpha Zero (Silver et al., 2017b), which incorporates a learned value function and policy network to guide the search process. To address challenges in balancing exploration and exploitation, we utilize a variant of MCTS with a modified Predictor Upper Confidence Tree (PUCT) algorithm (Rosin, 2011). The algorithm selects actions at at each node st as follows: at = arg max a Q(st, a) + U(st, a) , Language Models as Implicit Tree Search where U(st, a) is calculated using the formula: U(s, a) = cpuct πθ(s, a) b N(s, b) 1 + N(s, a) . Here, N(s, a) represents the visit count of action a at node s, and cpuct is a constant controlling exploration, as defined by: cpuct = log P b N(s, b) + cbase + 1 Node Expansion and Assessment: Upon reaching a leaf node s L, if it is not terminal, the tree is expanded by generating possible successor nodes. The value of the leaf node is then estimated using a neural network. For terminal nodes, a reward function R(s L) is used, or an Outcome Reward Model (ORM) serves as an approximation (Uesato et al., 2022). Value Propagation: Once a leaf node is assessed, the computed values are propagated back through the path s0, s1, . . . , s L. For each node, the visit count is updated as: N(st, a) = N(st, a) + 1, and the cumulative action value is updated as: W(st, a) = W(st, a) + v(s L). The mean action value is then computed as: Q(st, a) = W(st, a) This combination of learned value functions and search-based methods enables efficient exploration of large decision spaces, as demonstrated in Alpha Zero (Silver et al., 2017b) and its applications to tree-search-guided language models (Yao et al., 2023; Hao et al., 2023). B.1. Proof of Proposition 3.1 Proof. Given r R and a AMCTS/{a}, n(s, a ) are fixed, the probability of empirical visit distribution to the action a denotes as 1+n(s,a)r a AMCTS n(s,a )r . It holds 1 + n(s, a)r |AMCTS| + P a AMCTS n(s, a )r = 1 1 (|AMCTS| + P a AMCTS/{a} n(s, a )r) |AMCTS| + P a AMCTS/{a} n(s, a )r + n(s, a)r (22) Since |AMCTS| + P a AMCTS/{a} n(s, a )r is constant, the exponential empirical visit distribution only changes along with n(s, a )r changes. Due to |AMCTS| > 1 as tree definition, |AMCTS| + P a AMCTS/{a} n(s, a )r > 1. Note that f(x) = 1 1 c c+x is bijective with respect to c > 1 and when r = 0, {nr : n Z {0}} holds the consistent cardinality with Z. The proposition has been proved. B.2. Proof of Proposition 3.2 Proof. To prove the proposition, we only need to prove the following lemma: Lemma B.1. Given positive integers a, b, and {ci}a i=1, define f(r) = 1 + br a + 1 + Pa i=1 cr i + br . We claim that for all real r, the value of f(r) lies strictly within the open interval 1 a+1, 1 Language Models as Implicit Tree Search Proof. 1. As r : Since br 0 and cr i 0 for each i, we have lim r f(r) = lim r 1 + br a + 1 + Pa i=1 cr i + br = 1 + 0 a + 1 + 0 + 0 = 1 a + 1. Thus f(r) never goes below 1/(a + 1). 2. As r + : Let M = max{b, c1, c2, . . . , ca}. For sufficiently large r, M r dominates br and each cr i , so the numerator and denominator in f(r) are asymptotically proportional to M r, giving lim r + f(r) = 1. Since br and cr i increase with r, one can show f(r) itself is strictly increasing in r. Consequently, its image over r R is precisely 1 a+1, 1 . Counterexample. Because f(r) is bounded below by 1 a+1, any real x with 0 < x < 1 a + 1 cannot be realized by f(r). For instance, choose x0 = 1 2(a + 1). 0 < 1 2(a + 1) < 1 a + 1, but there is no real r for which f(r) = x0. Hence, values in 0, 1 a+1 are not attainable. Set a = |AMCTS| 1, b = n(s, a), and ci = n(s, ai), ai AMCTS/{a}, the proposition has been proved. B.3. Proof of Lemma 4.1 Proof. As discussed in Lemma.3.4, πφ(at|st) = λN(st) πθ(at|st) α(st) Qπθ(st, at) r(st, at) = α(st) V πθ(st) λN(st) πθ(at|st) πφ(at|st), (23) where Qπθ(st, at) = r(st, at) + V πθ(st). Provided (x, y(w)) denoting a pair of prompt and its preferred response, we construct the preferred state-action trajectory along with the token-level MDP, i.e., {(s(w) t , a(w) t )}Tw t=1 w.r.t. s(w) t = (x, y(w) 0 all reward classes consistent with the Plackett-Luce (and Bradley-Terry) models, the step-wise reward r(st, at) can be represented with the a re-parameterization r(st, at) = β log π(at|st) β log πref(at|st) (27) within the token MDP where V (st) = 0 for all terminal state. Here we combine Lemma.4.1 and Lemma.B.2 to certify our theorem. Specifically, V πθ(s T ) = V (s T ) = 0 for all terminal state given any sequence, since θ is optimized from DPO-like algorithms. By Lemma.B.2, we have the dense reward decomposition r(s(w) t , ˆat) = β log πθ(ˆat|s(w) t ) β log πref(ˆat|s(w) t ) = β log πθ(ˆat|s(w) t ) πref(ˆat|s(w) t ) , t [Tw]. (28) The decomposition holds in the dispreferred states such that r(s(l) t , ˇat) = β log πθ(ˇat|s(l) t ) β log πref(ˇat|s(l) t ) = β log πθ(ˇat|s(l) t ) πref(ˇat|s(l) t ) , t [Tl]. (29) where the dispreferred state-action trajectory along with the token-level MDP, i.e., {(s(l) t , a(l) t )}Tl t=1 is constructed by Language Models as Implicit Tree Search s(l) t = (x, y(l) 0, (x, yi, yj) top K({Uss(x, yi, yj)}i =j [K]), or build the sentence-level policy augmentation pool D+ as = {(x, yi, yj)| i = j [K], Uas(x, yi, yj) > 0, (x, yi, yj) top K({Uas(x, yi, yj)}i =j [K]) Minimize Lc DPO(θ) to post-train LLM θ with the prompt and its pairwise responses drawn from D D+ ss or D D+ as; Nalter = Nalter + 1; until Nalter = nalter θ θ, φ φ. C. Implementation The algorithm pipeline of IT-PO are presented in Algorithm.1. The algorithm implementation almost holds the consistency with its theoretical foundation except for two details:(1) the specification of λN(st); (2) the φ s gradient implementation in Lss-IT-PO(φ; θ) and Las-IT-PO(φ; θ). Suppose that λN(s(w) t ), λN(s(l) t ) denote the number of reasoning paths generated from the state st, which consists of a prompt x and its responses y(w), y(l) with their contents in the previous t 1 steps. For each preference pair, we set t {1, . . . , Tw}, λN(s(w) t ) = Kw and t {1, . . . , Tl}, λN(s(l) t ) = Kl, where Kw, Kl denote how many search start from the t-th leaf nodes s(w) t , s(l) t , respectively. Notice that Tw, Tl change with respect to the preference pair. So IT-PO adaptively configure Kw = K Tw , Kl = K Tl to balance the optimization with different response lengths in y(w), y(l) for each pair. We set K = 8, inspired from the number of sampled responses for each prompt in many RLHF implementations. In φ s gradient analysis in Lss-IT-PO(φ; θ), the gradient of φ consists of terms in the form πθ(at|st) πφ(at|st) log[ π(at|st)]. Due to πθ(at|st) πφ(at|st) (0, + ), updating the models with πθ(at|st) πφ(at|st) log[ π(at|st)] may suffer from exploding/vanishing gradients. To this, we used their logarithmic scaling πθ(at|st) πφ(at|st) exp(log(πθ(at|st)) log(πφ(at|st))) to ensure the less sensitive Language Models as Implicit Tree Search Table 5. Task setups. The node, tree max width, and tree max depth are search space parameters. The max tree-width and tree-depth follow the empirical experience in (Feng et al., 2023). Task Task Category Train/test size Node Tree Max width Tree Max depth GSM8k Mathematical Reasoning 7.5k / 1.3k Sentence 6 8 Game24 Mathematical Planning 1.0k / 0.3k Sentence 20 4 MATH Mathematical Reasoning - / - Sentence 6 8 Table 6. Performance Comparison on Proof Writer and Chess Endgame Tasks Setting Baselines Proof Writer (Acc %) Chess Endgame (Win rate %) Path@1 Co T-greedy 37.72 58.14 BFS-V 48.94 67.75 MCTS-α 66.71 96.90 MCTS-rollout 69.23 98.76 ITS-α (ours) 71.77 99.21 ITS-rollout (ours) 75.31 99.83 Equal-Token Co T-SC-MAJ 36.50 9.84 Co T-SC-MAJ 36.58 73.80 BFS-V-ORM 63.42 93.18 MCTS-ORM 60.86 94.26 ITS-α (ours) 74.26 96.48 ITS-rollout (ours) 78.15 98.57 update ratio. In φ s gradient analysis in Las-IT-PO(φ; θ), it can derive the similar gradient as φµS w(φ, θ) = t=1 λN(s(w) t ) |A(w) t | Y πθ(a(w,i) t |s(w,i) t ) πφ(a(w,i) t |s(w,i) t ) |A(w) t | X j=1 log[πφ(a(w,j) t |s(w,j) t )] πθ(ˆa(i) t |ˆs(i) t ) πφ(ˆa(i) t |ˆs(i) t ) j=1 log[πφ(ˆa(j) t |ˆs(j) t )] ; φµS l (φ, θ) = t=1 λN(s(l) t ) |A(l) t | Y πθ(a(l,i) t |s(l,i) t ) πφ(a(l,i) t |s(l,i) t ) |A(l) t | X j=1 log[πφ(a(l,j) t |s(l,j) t )] πθ(ˇa(i) t |ˇs(i) t ) πφ(ˇa(i) t |ˇs(i) t ) j=1 log[πφ(ˇa(j) t |ˇs(j) t )] , where we also took the logarithmic scaling for the gradients in their implementation, e.g., |A(w) t | Y πθ(a(w,i) t |s(w,i) t ) πφ(a(w,i) t |s(w,i) t ) |A(w) t | Y i=1 exp log(πθ(a(w,i) t |s(w,i) t )) log(πφ(a(w,i) t |s(w,i) t )) (44) and so do others. To train our policy networks in GSM8K, Game24, and MATH, we propose the preference data reconfigured from their training set. For each question x with correct response y, we compute log πref(y |x) across all responses y from another Language Models as Implicit Tree Search question to generate the hard negative preference pairs, then take them to initialize πθ via c DPO. We set ϵ=0 to human alignment task, yet set ϵ=0.25 to mathematical reasoning and planning tasks. As for tree-based decoding, we employ the same tree-width pruning strategy in (Feng et al., 2024), difference only rises from the deterministic decision making or stochastic decision making. In GSM8K, we also provided the component analysis by varying ϵ in the range {100%, 75%, 50%, 25%, 0%}, along with the results ranged from 53.2, 52.9, 53.4, 51.8, 50.1, accordingly. The performance drastically drops when the ratio less than 50%. D. LLM-based Reasoning Experiments Beyond Mathematics We offer evaluation based on Proof Writer (Tafjord et al., 2020) for deductive logical reasoning, and Chess Endgame (Abdulhai et al., 2023) for long-term multi-turn decision making. For Proof Writer, we follow (Pan et al.) to generate the test set, then the rest are merged to 41,433 training instances. All training and test instances employed the prompt template in (Pan et al.) that initiated the start of Co T, then we employ LLAMA2-7B as the base model and all fine-tuning methods only run for a single epoch. For Chess Endgame, we follow the experimental setup in (Feng et al., 2023). With regards to each prompt-response pair (x, y(w)), in Proof Writer and Chess Endgame, we find the dispreferred response y(l) using the same strategy in our mathematical reasoning tasks. We ensure the evaluation in the fair comparison with the Co T and LLM-based tree-search baselines : Co T-greedy, BFS-V, MCTStr, MCTS-rollout, Co T-SC-MAJ, Co T-SC-ORM, BFS-V-ORM, MCTS-ORM, whose implementations are consistent in the paper. For simplicity, we skip the average token number metric to highlight Acc in Proof Writer and Win rate in Chess Endgame. While their results remain based on Path@1 to promise the computation efficiency, and Equal-Token to encourage the comparison in the similar scale of computation consumption cross baselines. In the table, we found that Co T variants almost fail in Proof Writer due to their performances close to random guess (33.33%). MCTS variants obtain significantly better results yet basically under-perform ITS variants with substantial gap in ACC, probably due to the cold-start effect in MCTS learned with one epoch. As for Chess Endgame, ITS variants almost solve the problems with Win rates 99.83% in Path@1 and 98.57% in Equal-Token. It proves ITS also competitive in long-horizon reasoning.