# trustregion_twisted_policy_improvement__3ee22306.pdf Trust-Region Twisted Policy Improvement Joery A. de Vries 1 Jinke He 1 Yaniv Oren 1 Matthijs T. J. Spaan 1 Monte-Carlo tree search (MCTS) has driven many recent breakthroughs in deep reinforcement learning (RL). However, scaling MCTS to parallel compute has proven challenging in practice which has motivated alternative planners like sequential Monte-Carlo (SMC). Many of these SMC methods adopt particle filters for smoothing through a reformulation of RL as a policy inference problem. Yet, persisting design choices of these particle filters often conflict with the aim of online planning in RL, which is to obtain a policy improvement at the start of planning. Drawing inspiration from MCTS, we tailor SMC planners specifically to RL by improving data generation within the planner through constrained action sampling and explicit terminal state handling, as well as improving policy and value target estimation. This leads to our Trust-Region Twisted SMC (TRT-SMC), which shows improved runtime and sample-efficiency over baseline MCTS and SMC methods in both discrete and continuous domains. 1. Introduction Monte-Carlo tree search (MCTS) with neural networks (Browne et al., 2012) has enabled many recent successes in sequential decision making problems such as board games (Silver et al., 2018), Atari (Ye et al., 2021), and algorithm discovery (Fawzi et al., 2022; Mankowitz et al., 2023). These successes demonstrate that combining decision-time planning and reinforcement learning (RL) often outperforms methods that utilize search or deep neural networks in isolation. Naturally, this has stimulated many studies to understand the role of planning in combination with learning (Guez et al., 2019; de Vries et al., 2021; Hamrick et al., 2021; Bertsekas, 2022; He et al., 2024) along with studies to improve specific aspects of the base algorithm (Hubert 1Delft University of Technology, Delft, the Netherlands. Correspondence to: Joery A. de Vries . Proceedings of the 42 nd International Conference on Machine Learning, Vancouver, Canada. PMLR 267, 2025. Copyright 2025 by the author(s). et al., 2021; Danihelka et al., 2022; Antonoglou et al., 2022; Oren et al., 2025). Although MCTS has been a leading technology in recent breakthroughs, the tree search is inherently sequential, can deteriorate agent performance at small planning budgets (Grill et al., 2020), and requires significant modifications for general use in RL (Hubert et al., 2021). The sequential nature of MCTS is particularly crippling as it limits the full utilization of modern hardware such as GPUs. Despite follow-up work attempting to address specific issues, alternative planners have since been explored that avoid these flaws. Successful alternatives in this area are often inspired by stochastic control (Del Moral, 2004; Astr om, 2006), examples include path integral control (Theodorou et al., 2010; Williams et al., 2015; Hansen et al., 2022) and related sequential Monte-Carlo (SMC) methods (Naesseth et al., 2019; Chopin & Papaspiliopoulos, 2020). Specifically, recent variational SMC planners (Naesseth et al., 2018; Macfarlane et al., 2024) have shown great potential in terms of generality, performance, and scalability to parallel compute. These methods adopt a particle filter for trajectory smoothing to enable planning in RL (Pich e et al., 2019). However, the distribution of interest for these particle filters do not perfectly align with learning and exploration for RL agents. Namely, recent SMC methods focus on estimation of the trajectory distribution under an unknown policy, and not the actual unknown policy at the state where we initiate the planner. We find that this mismatch can cause SMC planners to suffer from unnecessarily high-variance estimation and waste much of their compute and data during planning. In other words, online planning in RL should serve as a local approximate policy improvement (Chan et al., 2022; Sutton & Barto, 2018). Fortunately, existing MCTS and SMC literature provides various directions to achieve this (Moral et al., 2010; Svensson et al., 2015; Lawson et al., 2018; Danihelka et al., 2022; Grill et al., 2020), but their use in SMC-planning remains largely unrealized. This paper aims to address the current limitations of bootstrapped particle filter planners for RL by drawing inspiration from MCTS. Our contributions are 1) to make more accurate estimates of statistics extracted by the planner, at the start of planning, and 2) enhancing data-efficiency inside the planner. We address the pervasive path-degeneracy prob- Trust-Region Twisted Policy Improvement lem in SMC by backing up accumulated reward and value data to perform policy inference and construct value targets. Next, to reduce the variance of estimated statistics due to resampling, we use exponential twisting functions to improve the sampling trajectories inside the planner. We also impose adaptive trust-region constraints to a prior policy, to control the bias-variance trade-off in sampling proposal trajectories. Finally, we modify the resampling method (Naesseth et al., 2019) to correct particles that become permanently stuck in absorbing states due to termination in the baseline SMC. We dub our new method Trust-Region Twisted SMC and demonstrate improved sample-efficiency and runtime scaling over SMC and MCTS baselines in discrete and continuous domains. 2. Background We want to find an optimal policy π for a sequential decisionmaking problem, which we formalize as an infinite-horizon Markov decision process (MDP) (Sutton & Barto, 2018). We define states S S, actions A A, and rewards R R as random variables, where we write H1:T = {St, At}T t=1 as the joint random variable of a finite sequence, pπ(H1:T ) = t=1 π(At|St)p(St|St 1, At 1), (1) where p(S1|A0, S0) = p(S1) is the initial state distribution, p(St+1|St, At) is the transition model, and π(At|St) is the policy. We denote the set of admissible policies as Π = {π|π : S P(A)}, our aim is to find a parametric πθ Π (e.g., a neural network) such that Epπθ (H1:T )[PT t=1 Rt] is maximized, where we abbreviate Rt = R(St, At) for the rewards. For convenience, we subsume the discount factor γ [0, 1] into the transition probabilities as a termination probability of pterm = 1 γ assuming that the MDP always ends up in an absorbing state ST with zero rewards. 2.1. Control as Inference The reinforcement learning problem can be recast as a probabilistic inference problem through the control as inference framework (Levine, 2018). This reformulation has lead to successful algorithms like MPO (Abdolmaleki et al., 2018) that naturally allow regularized policy iteration (Geist et al., 2019), which is highly effective in practice with neural network approximation. Additionally, it enables us to directly use tools from Bayesian inference on our graphical model. To formalize this, the distribution for H1:T can be conditioned on a binary outcome variable Ot {0, 1}, then given a likelihood for p(O1:T = 1|H1:T ) = p(O1:T |H1:T ) this gives rise to a posterior distribution p(H1:T |O1:T ). Typically, we use the exponentiated sum of rewards for the (unnormalized) likelihood p(O1:T |H1:T ) QT t=1 exp Rt. Definition 2.1. The posterior factorizes as pπ(H1:T |O1:T ) = t=1 pπ(At|St, Ot:T )p(St|St 1, At 1), assuming that St+1 O1:T |St, At and At O 0 = pq(H) > 0 almost everywhere. ln pπ(O) = ln pπ(O, H) ln pπ(O, H) pπ(H|O) pq(H) pq(H) = Epq(H) ln pπ(O, H) pq(H) + Epq(H) ln pq(H) pπ(H|O) ln p(O|H) + ln pπ(H) + Epq(H) ln pq(H) pπ(H|O) = Epq(H) [ln p(O|H) KL(pq(H) π(H))] + KL(pq(H) pπ(H|O)) t Rt KL(q(a|St) π(a|St)) + KL(pq(H) pπ(H|O)) where in the last step the transition terms for pπ(H) and pq(H) cancel out. See the work by Levine (2018) for comparison. The result from Lemma A.1 shows that the log-likelihood is decomposed into an evidence lower-bound and an evidence gap. This inequality becomes tight when q is simply set to the posterior policy q (H) pπ(H|O). Thus, motivating the maximization objective for the lower-bound as given in Theorem 2.2, or equivalently, by minimizing the evidence gap as considered by Levine (2018). A.2. Regularized Policy Improvement The result below gives a brief sketch that the expectation-maximization loop (Neal & Hinton, 1998) generates a sequence of regularized Markov decision processes (MDPs) that eventually converges to a locally optimal policy. This result is a nice consequence of the control-as-inference framework. A more general discussion outside of the expectation-maximization framework can be found in the work by Geist et al. (2019). Trust-Region Twisted Policy Improvement Lemma A.2 (Regularized Policy Improvement). The solution q to the problem, t=1 Rt KL(q(a|St) π(a|St)) guarantees a policy improvement in the unregularized MDP, Epq (H)[P t Rt] Epπ(H)[P Proof. Lemma A.1 shows that the solution q is equivalent to the posterior policy distribution pπ(H1:T |O1:T ). This implies that for each state s S, we have, q (a|s) π(a|s) exp Qπ soft(s, a). The exponential over Qπ soft(s, a) in the posterior policy q interpolates π to the greedy policy by shifting probability density to actions with larger expected cumulative reward. Thus, q provides a policy improvement over π. A.3. Proof of Theorem 2.2 Proof of Theorem 2.2. Lemma A.1 shows how the posterior policy coincides with the optimal policy q in a regularized Markov decision process (MDP). Then Lemma A.2 describes that this gives a policy improvement in the unregularized MDP. Iterating this process (e.g., an expectation-maximization loop) yields consecutive improvements to the prior π(n) q (n 1) and guarantees a locally optimal π in the unregularized MDP as n , which also implies KL(q (n) π(n 1)) 0. A.4. Importance sampling weights For completeness, we give a detailed derivation for the result presented in Corollary 2.3. This derivation differs from the one given by Pich e et al. (2019) in Appendix A.4. Our derivation corrects for the fact that we don t need to compute an expectation over the transition function for exp V π soft(St) in the denominator. However, this is only a practical difference (i.e., how the algorithm is implemented) to justify our calculation. Corollary A.3 (Restated Corollary 2.3). Assuming access to the transition model p(St+1|St, At), we obtain the importance sampling weights for pπ(H1:t|O1:T )/pq(H1:t), wt = wt 1 π(At|St) q(At|St) exp(Rt) E[exp V π soft(St+1)] exp V π soft(St) , Proof. For any statistic f( ), we have, Epπ(H1:t|O1:T )f(H1:t) = Epq(H1:t) [wt f(H1:t)] = Epq(H1:t) pπ(H1:t|O1:T ) pq(H1:t) f(H1:t) = Epq(H1:t) pπ(St, At|H