# trustregion_twisted_policy_improvement__3ee22306.pdf

Trust-Region Twisted Policy Improvement

Joery A. de Vries 1 Jinke He 1 Yaniv Oren 1 Matthijs T. J. Spaan 1

Monte-Carlo tree search (MCTS) has driven many recent breakthroughs in deep reinforcement learning (RL). However, scaling MCTS to parallel compute has proven challenging in practice which has motivated alternative planners like sequential Monte-Carlo (SMC). Many of these SMC methods adopt particle filters for smoothing through a reformulation of RL as a policy inference problem. Yet, persisting design choices of these particle filters often conflict with the aim of online planning in RL, which is to obtain a policy improvement at the start of planning. Drawing inspiration from MCTS, we tailor SMC planners specifically to RL by improving data generation within the planner through constrained action sampling and explicit terminal state handling, as well as improving policy and value target estimation. This leads to our Trust-Region Twisted SMC (TRT-SMC), which shows improved runtime and sample-efficiency over baseline MCTS and SMC methods in both discrete and continuous domains.

1. Introduction

Monte-Carlo tree search (MCTS) with neural networks (Browne et al., 2012) has enabled many recent successes in sequential decision making problems such as board games (Silver et al., 2018), Atari (Ye et al., 2021), and algorithm discovery (Fawzi et al., 2022; Mankowitz et al., 2023). These successes demonstrate that combining decision-time planning and reinforcement learning (RL) often outperforms methods that utilize search or deep neural networks in isolation. Naturally, this has stimulated many studies to understand the role of planning in combination with learning (Guez et al., 2019; de Vries et al., 2021; Hamrick et al., 2021; Bertsekas, 2022; He et al., 2024) along with studies to improve specific aspects of the base algorithm (Hubert

1Delft University of Technology, Delft, the Netherlands. Correspondence to: Joery A. de Vries <J.A.de Vries@tudelft.nl>.

Proceedings of the 42 nd International Conference on Machine Learning, Vancouver, Canada. PMLR 267, 2025. Copyright 2025 by the author(s).

et al., 2021; Danihelka et al., 2022; Antonoglou et al., 2022; Oren et al., 2025).

Although MCTS has been a leading technology in recent breakthroughs, the tree search is inherently sequential, can deteriorate agent performance at small planning budgets (Grill et al., 2020), and requires significant modifications for general use in RL (Hubert et al., 2021). The sequential nature of MCTS is particularly crippling as it limits the full utilization of modern hardware such as GPUs. Despite follow-up work attempting to address specific issues, alternative planners have since been explored that avoid these flaws. Successful alternatives in this area are often inspired by stochastic control (Del Moral, 2004; Astr om, 2006), examples include path integral control (Theodorou et al., 2010; Williams et al., 2015; Hansen et al., 2022) and related sequential Monte-Carlo (SMC) methods (Naesseth et al., 2019; Chopin & Papaspiliopoulos, 2020).

Specifically, recent variational SMC planners (Naesseth et al., 2018; Macfarlane et al., 2024) have shown great potential in terms of generality, performance, and scalability to parallel compute. These methods adopt a particle filter for trajectory smoothing to enable planning in RL (Pich e et al., 2019). However, the distribution of interest for these particle filters do not perfectly align with learning and exploration for RL agents. Namely, recent SMC methods focus on estimation of the trajectory distribution under an unknown policy, and not the actual unknown policy at the state where we initiate the planner. We find that this mismatch can cause SMC planners to suffer from unnecessarily high-variance estimation and waste much of their compute and data during planning. In other words, online planning in RL should serve as a local approximate policy improvement (Chan et al., 2022; Sutton & Barto, 2018). Fortunately, existing MCTS and SMC literature provides various directions to achieve this (Moral et al., 2010; Svensson et al., 2015; Lawson et al., 2018; Danihelka et al., 2022; Grill et al., 2020), but their use in SMC-planning remains largely unrealized.

This paper aims to address the current limitations of bootstrapped particle filter planners for RL by drawing inspiration from MCTS. Our contributions are 1) to make more accurate estimates of statistics extracted by the planner, at the start of planning, and 2) enhancing data-efficiency inside the planner. We address the pervasive path-degeneracy prob-

Trust-Region Twisted Policy Improvement

lem in SMC by backing up accumulated reward and value data to perform policy inference and construct value targets. Next, to reduce the variance of estimated statistics due to resampling, we use exponential twisting functions to improve the sampling trajectories inside the planner. We also impose adaptive trust-region constraints to a prior policy, to control the bias-variance trade-off in sampling proposal trajectories. Finally, we modify the resampling method (Naesseth et al., 2019) to correct particles that become permanently stuck in absorbing states due to termination in the baseline SMC. We dub our new method Trust-Region Twisted SMC and demonstrate improved sample-efficiency and runtime scaling over SMC and MCTS baselines in discrete and continuous domains.

2. Background

We want to find an optimal policy π for a sequential decisionmaking problem, which we formalize as an infinite-horizon Markov decision process (MDP) (Sutton & Barto, 2018). We define states S S, actions A A, and rewards R R as random variables, where we write H1:T = {St, At}T t=1 as the joint random variable of a finite sequence,

pπ(H1:T ) =

t=1 π(At|St)p(St|St 1, At 1), (1)

where p(S1|A0, S0) = p(S1) is the initial state distribution, p(St+1|St, At) is the transition model, and π(At|St) is the policy. We denote the set of admissible policies as Π = {π|π : S P(A)}, our aim is to find a parametric πθ Π (e.g., a neural network) such that Epπθ (H1:T )[PT t=1 Rt] is maximized, where we abbreviate Rt = R(St, At) for the rewards. For convenience, we subsume the discount factor γ [0, 1] into the transition probabilities as a termination probability of pterm = 1 γ assuming that the MDP always ends up in an absorbing state ST with zero rewards.

2.1. Control as Inference

The reinforcement learning problem can be recast as a probabilistic inference problem through the control as inference framework (Levine, 2018). This reformulation has lead to successful algorithms like MPO (Abdolmaleki et al., 2018) that naturally allow regularized policy iteration (Geist et al., 2019), which is highly effective in practice with neural network approximation. Additionally, it enables us to directly use tools from Bayesian inference on our graphical model.

To formalize this, the distribution for H1:T can be conditioned on a binary outcome variable Ot {0, 1}, then given a likelihood for p(O1:T = 1|H1:T ) = p(O1:T |H1:T ) this gives rise to a posterior distribution p(H1:T |O1:T ). Typically, we use the exponentiated sum of rewards for the (unnormalized) likelihood p(O1:T |H1:T ) QT t=1 exp Rt.

Definition 2.1. The posterior factorizes as

pπ(H1:T |O1:T ) =

t=1 pπ(At|St, Ot:T )p(St|St 1, At 1),

assuming that St+1 O1:T |St, At and At O<t|St.

The key part of Definition 2.1 is the posterior policy pπ(At|St, Ot:T ) which, using Bayes rule, reads as

pπ(At|St, Ot:T ) π(At|St) exp[ln p(Ot:T |St, At)]. (2)

This connects us to an (expected-reward) maximum-entropy reinforcement learning setting (Toussaint, 2009; Ziebart, 2010), since the exponent admits a soft-Bellman recursion,

ln p(Ot:T |St, At) = R(St, At) + ln E e V π soft(St+1), (3)

where the expectation is over the dynamics p(St+1|St, At). An important property of the posterior policy is that it is not an optimal policy in the traditional objective Epπ[P

t Rt], but only provides an improvement over a prior policy π.

Theorem 2.2. The posterior policy p(At|St, Ot:T ), t, is the optimal policy q for the regularized MDP,

q = arg max q Π E

t=1 Rt KL(q(a|St) π(a|St))

where q guarantees a policy improvement in the unregularized MDP, Epq (H1:T )[PT t=1 Rt] Epπ(H1:T )[PT t=1 Rt].

Proof. See Appendix A.3 or Sec. 2.4 of (Levine, 2018).

The optimal regularized policy q can be interpreted as a variational policy1 that minimizes the evidence gap (Bishop, 2007), or conversely, maximizes the evidence lower-bound to ln p(O1:T ). In practice, this can be used in expectationmaximization (EM) methods (Neal & Hinton, 1998) to iteratively update the prior policy π(n+1) q (n). This gives rise to an effective framework for approximate policy iteration that is both amenable to gradient based updating of πθ and provably recovers the traditional (locally) optimal policy maxπ Π Epπ[P

t Rt] as n (see Appendix A).

2.2. Particle Filter Planning for RL

Although the estimation of the regularized optimal policy q

mitigates a number of practical challenges in deep RL, it also requires re-estimating q (n) after each update to the prior π(n+1). Additionally, the posterior policy for any π always requires the solutions to the soft value-estimation problem for Qπ soft and V π soft. Algorithms like MPO deal

1From this point on we will interchange the regularized optimal policy q (n)(At|St) with the posterior policy pπ(n)(At|St, Ot:T ).

Trust-Region Twisted Policy Improvement

with this by approximating the value using neural networks Qπ θ Qπ soft to then estimate the posterior policy through Monte-Carlo. Intuitively, this can be interpreted as a 1-timestep approximation to q . To see this, evaluating Qπ θ (St, At) over samples At π(At|St) amortizes the message-passing process for evaluating Qπ soft into a direct (cheap) mapping from S A R. The main limitation of this approach is the bias induced by this new function Qπ θ . Towards this end, sequential Monte-Carlo (SMC) methods (Pich e et al., 2019; Naesseth et al., 2019) offer a powerful model-based strategy for improving the estimate of q

through a multi-timestep.

The SMC algorithm for RL (Pich e et al., 2019) is a sequential importance sampling (IS) method that aims to draw samples H1:t from the posterior through a proposal distribution pq(H1:t) (as in Eq. 1 for some q Π). By definition, our graphical model (Def. 2.1) factorizes recursively,

pπ(H1:T |O1:T ) = pπ(H1:t|O1:T )pπ(Ht+1:T |Ht, Ot+1:T ), (4)

meaning that we can sample data from H1:t pq(H1:t) and accumulate the IS-weights sequentially.

Corollary 2.3 (Sequential importance sampling). Assuming access to the transition model p(St+1|St, At), we obtain the importance sampling weights for pπ(H1:t|O1:T )/pq(H1:t),

wt = wt 1 π(At|St)

q(At|St) exp(Rt) E[exp V π soft(St+1)] exp V π soft(St) .

Proof. The dynamics terms in the weights cancel out, the rest follows by definition from Equations 2, 3, and 4.

In practice we cannot realistically compute wt since it requires the soft-values V π soft. However, we can again approximate this ewt wt with the amortized estimate V π θ V π soft at every intermediate timestep (Pitt & Shephard, 1999; Lawson et al., 2018). With this, we can estimate posterior statistics using a single forward pass through: sampling traces H(i) 1:t, accumulating their approximate weights ew(i) t ,

and then normalizing these w(i) t = e w(i) t P

j e w(j) t to obtain

Epπ(H1:t|O)f(H1:t) = Epq[wt f(H1:t)] (5)

w(i) t f(H(i) 1:t), H(i) 1:t pq,

where f : Ht Y is some arbitrary function over the data. For instance, the statistic f(H1:t) = Pt j=1 Rj would yield an estimate of the expected finite-horizon sum of rewards.

Algorithm 1 Bootstrapped Particle Filter for RL Require: K (number of particles), m (depth)

1: Initialize:

Ancestor identifier {J(i) 1 = i}K i=1 States {S(i) 1 p(S1)}K i=1 Weights { ew(i) 0 = 1}K i=1 2: for t = 1 to m do

// Update particles

3: {A(i) t q(At|S(i) t )}K i=1 4: {S(i) t+1 p(St+1|S(i) t , A(i) t )}K i=1

5: { ew(i) t = ew(i) t 1 π(A(i) t |S(i) t )

q(A(i) t |S(i) t ) e R(i) t E exp V π θ (S(i) t+1)

exp V π θ (S(i) t ) }K i=1 // Bootstrap (periodically) through resampling

6: {(J(i) t , S(i) t+1, A(i) t )}K i=1 Multinomial(K, wt)

7: { ew(i) t = 1}K i=1 8: end for

9: return {J(i) 1:m, H(i) 1:m, ew(i) 1:m}K i=1

Bootstrapped Filter The bootstrapped particle filter for RL, as proposed by Pich e et al. (2019), improves the estimation in Eq. 5 through (periodic) resampling. This mitigates the issue of weight-impoverishment in sequential-IS, where some normalized weights dominate others w(i) t w(j) t , j = i. A common strategy for this is multinomial resampling (Chopin & Papaspiliopoulos, 2020), as shown in Algorithm 1. This method samples a number of traces H(i) 1:m pq, i [1, K], referred to as particles, and periodically drops or duplicates these samples based on their weights w(i) 1:m, m [1, t], before resetting their weights back to a uniform distribution (bootstrap). Prior work (Pich e et al., 2019; Macfarlane et al., 2024) then estimates the policy ˆq as a weighted mixture of point-masses,

ˆq (At = a|St) =

w(i) t+mδ(A (J(i) t+m) t = a), (6)

where J(i) t+m [1, K] is the index tracking each samples ancestor at the start of planning t and δ( ) is a Dirac delta function over A for the ancestor action-particles.

A key property of Algorithm 1 is that it can estimate ˆq

through a single forward pass of K particles from t to t+m (Moral et al., 2010). This allows for parallel sampling and updating of particles, with communication only required during the resampling step. As a result, it scales efficiently on modern GPU hardware, with memory complexity of O(K) and a parallelized time complexity of O(m).

Variational SMC Intuitively, resampling improves our estimate of Eq. 5 by correcting particles with low-likelihood

Trust-Region Twisted Policy Improvement

to higher likelihood regions of the target distribution. However, resampling also introduces noise and would ideally not be needed (or to a lesser extent) if our proposal distribution pq produced samples that better match our target. To this end, variational SMC methods (Naesseth et al., 2018; Gu et al., 2015) learn a proposal distribution qθ from sample estimates of the posterior target E[ˆq ] = q . Macfarlane et al. (2024) use this strategy with neural networks to learn qθ q . Given the neural network iterates {θ(n)}N n=1, their method also updates the prior policy π(n+1) qθ(n) at every step n. In other words, they combine variational SMC in an EM-loop (Neal & Hinton, 1998) by using the learned proposals qθ(n) as the prior to regularize the posterior pπ(n+1)(At|St, Ot:T ).

3. Tailoring Particle Filter Planning to RL

A key benefit of online planning in reinforcement learning (RL) is generating locally improved policies at each timestep t compared to a prior policy, which enhances learning targets and diversity of data (Hamrick et al., 2021; Silver et al., 2018). Despite recent progress in improving sequential Monte-Carlo (SMC) methods for local approximate policy improvement (Chan et al., 2022), persisting design choices from conventional particle filtering do not align directly with this goal. For instance, we find that the problem of path degeneracy, inherent to particle filtering (Svensson et al., 2015), can cause policy inference in Eq. 6 to degenerate with deep search and waste much of the planner s generated data. We also argue that variational SMC planners still make inefficient use of their particle budget due to periodic resampling, which can cause particles to get stuck in terminal states and delays using value information. By contrasting this with Monte-Carlo tree search (MCTS), which specifically focuses on improving the policy at the root (Grill et al., 2020), we tailor the SMC-planner for local policy improvement. This preserves the forward-only implementation of particle filtering while incorporating benefits inspired by MCTS.

3.1. Adapting the Proposals for Sample-Efficiency

The resampling step in SMC (line 6, Alg. 1) is essential for redistributing particles that are unlikely under the posterior policy. Simultaneously, variational SMC methods (Macfarlane et al., 2024) can shift some of this responsibility to a learned proposal distribution qθ ˆq by generating better samples. This is useful as it reduces variance induced by resampling and amortizes posterior inference (Naesseth et al., 2018). However, in EM-loop style algorithms, the learned proposal distribution qθ is an estimate of the posterior policy that is regularized to a prior from a previous iteration qθ pπ(n 1)(At|St, Ot:T ), and not to the current iteration π(n). This is computationally wasteful with infre-

quent resampling, as it can take multiple transitions before moving particles to rewarding trajectories (Lioutas et al., 2023).

For this reason, we extend the proposal distribution to more accurately reflect the next posterior policy pπ(n)(At|St, Ot:T ) by using both a learned policy qθ π(n) and value exp Qπ θ . This is also known as exponential twisting (Asmussen & Glynn, 2007) of the proposal, which is similarly used in the target distribution (Corollary 2.3). Twisting reduces the variance for ˆq at the cost of some bias due to Qπ θ . This enables lower particle budgets through improved particle efficiency, but also introduces the difficulty in controlling this trade-off. In this regard, MCTS-based methods can offer some insight for dealing with this.

Since Alpha Zero (Silver et al., 2018), MCTS-based methods also often use a trained policy πθ (conflated with the name prior distribution) to guide the search in combination with a P-UCT algorithm (Rosin, 2011). Crucially, the initial iterations of the algorithm rely more on πθ, which is interpolated to a greedy policy over the estimated values Qπ

in later iterations. Grill et al. (2020) show how this causes inferred policies from MCTS to track a regularized objective similarly to Theorem 2.2 (with an added greediness parameter) over consecutive iterations. This is relevant to our work, because it links the estimated quantity of our SMC planner to the approach taken by MCTS. The main difference persists in that the iterations of MCTS induce an adaptive regularizer through a proxy for value-accuracy.

Inspired by this, we formulate our proposal distribution through a constrained optimization problem with an adaptive trust-region parameter ϵα R 0. At each state St, the proposal q solves for a locally constrained program,

max q P(A|St) Eq(At|St)Qπ θ (St, At), (7)

s.t., KL(q(a|St) πθ(a|St)) ϵα,

where α [0, 1] is a greediness tolerance level. The actual trust-region ϵα is then sandwiched between the prior πθ and the greedy policy π over Qπ θ , such that, ϵα = α KL(π πθ). In similar spirit to MCTS, this guarantees that SMC always searches trajectories that interpolate between maximizing Qπ θ or sticking to the prior πθ. The Lagrangian of this program is similar to Theorem 2.2 with the sum of rewards replaced by Qπ θ and the KL term scaled by a temperature. Furthermore, Eq. 7 requires KL(q π) ϵ at every St S instead of in expectation over pq(H). For solving this program, we use a bisection search using bootstrapped atoms from πθ, which is both general and computationally cheap (see Appendix B.5 for details).

Trust-Region Twisted Policy Improvement

3.2. Handling Terminal States with Revived Resampling

Although the control as inference framework from Section 2.1 enables the use of SMC methods for policy inference, it also introduces non-trivial caveats. In particular, the infinite horizon formulation in Eq. 1 can lead to wasting compute when handling terminal states as absorbing states in a forward-only SMC algorithm. Absorbing states result in transitions that loop back to the same state with zero reward, effectively treating this as part of the environment dynamics. In an SMC planner, this can cause some of the K particles to become trapped in these absorbing states. Although resampling should correct for this, reward sparsity and errors in value predictions V π θ V π soft can render the weights to become nearly uniform, making these trapped particles indistinguishable from non-trapped ones. This is a problem, because trapped particles do not contribute to gathering information about future values.

Furthermore, the resampling step can also move particles towards absorbing states due to rewarding trajectories in previous steps. Consecutive resampling in the SMC planner, however, is then likely to move these particles away from these states again because they no longer accumulate reward. This phenomena is mostly problematic when performing policy inference according to Eq. 6 and is a consequence of the path degeneracy problem (Svensson et al., 2015).

To address the trapped particles, we leverage the strategy of MCTS to reset trajectories to earlier states within the search tree. In MCTS, each iteration of the algorithm completely resets the agent to the root state, enabling the accumulation of new trajectory and value information. However, this full reset limits the ability to explore deep inside the search tree, which is a key benefit of SMC with its depth parameter m. Therefore, we propose a revived resampling strategy to move particles back to their last non-terminal state. This only requires caching an additional reference state for each particle, assuming that the environment correctly flags these as non-terminal. Resampling is then performed to these reference states instead of the current states (see Appendix C).

3.3. Mitigating Path Degeneracy for Policy Inference

A common problem in particle filtering is path degeneracy, where most particles collapse to a single ancestor due to resampling. This is illustrated in the top of Figure 1, the grayed-out trajectories highlight the discarded data due to resampling in typical forward-only SMC methods. This loss of ancestor diversity leads to a worse approximation of the distribution over states under our posterior policy (Svensson et al., 2015). We remark that in RL the consequence of this problem is exacerbated since we predominantly care about the root-ancestors. For this reason, SMC planners (Pich e et al., 2019; Macfarlane et al., 2024; Lioutas et al., 2023) that perform policy inference for ˆq using mixtures of point-

5 10 15 20 25 30 35 40 Depth

0 20 40 60 0.0

Dirac Mixture (baseline)

Message Passing (ours)

Num Particles K

K = 2 K = 4 K = 8 K = 24 K = 72 K = 216

Figure 1. Illustration of path degeneracy in particle filters (top; adapted from Svensson et al., 2015) and the consequence on divergence from the optimal policy π (lower is better). With a finite budget of particles and resampling, deep search will concentrate the Dirac mixture of remaining ancestors to a single atom (bottomleft). We improve this through approximate message-passing to the ancestors, which does not degenerate the policy (bottom-right).

masses from Eq. 6 show deteriorating approximation quality as depth increases, as shown in Figure 1 (bottom-left).

In contrast, MCTS does not suffer from path degeneracy because it does not discard any data. Policy inference is also typically done in MCTS by tracking a normalized stateaction visitation counter that is updated in each iteration of the algorithm (Browne et al., 2012). Importantly, the normalized visit-counts can be used as both a behavior policy and learning target (Silver et al., 2018). Recent work however, has shown that using such a visitation-count policy can degrade performance when using low planning budgets. Instead, Danihelka et al. (2022) show that inferring a regularized policy (Grill et al., 2020) using the value statistics from search for the root state-actions avoids this problem.

However, SMC planners do not track visit-counts or perform backpropagation to accumulate value statistics. Since the importance sampling weights factorize recursively, we can adopt the approach by Moral et al. (2010) to perform an online estimation to ˆQSMC Qπ soft inside SMC using Eq. 3, to then construct a policy estimate similarly to Danihelka et al. (2022). At each SMC step t, we accumulate a current estimate of ˆQ(j) SMC for any root-ancestor j as,

ˆQ(j) t = ˆQ(j) t 1 +

0, J (j) t = ,

ln 1 |J (j) t | P

e w(i) t e w(i) t 1

, J (j) t = ,

where J (j) t = {i [1, K] | J(i) t = j} and ˆQ(j) 0 = 0. In essence, this expression accumulates an average logprobability for the remaining particles belonging to ancestor j. We then use the values ˆQ(j) t+m Qπ soft(St, A(j) t ) to infer ˆq by bootstrapping the atoms proportionally to πθ(A(i) t |St) exp ˆQ(i) SMC . Figure 1 (bottom-right) shows that

Trust-Region Twisted Policy Improvement

this eliminates policy deterioration as a function of depth, reducing the consequence of path degeneracy.

3.4. Search-Based Values for Learning Targets

The approximate message passing from the previous subsection mitigates the path degeneracy problem by utilizing all SMC generated data for policy inference. In essence, we are constructing a better estimate of the soft-value functions V π θ V π soft (Lawson et al., 2018; Pich e et al., 2019) to then estimate ˆq as given by Eq. 2. The value functions themselves are then trained using some temporal difference (TD) method in an outer learning loop, e.g., using n-step returns (Sutton & Barto, 2018) given environment interactions outside of the planner (see also Appendix B.3). So far, most prior work constructs these TD-learning targets by valuebootstrapping from one-step predictions given states in the generated datasets (Pich e et al., 2019; Lioutas et al., 2023; Macfarlane et al., 2024). For instance, given a transition (St, At, Rt, St+1), a 1-step TD target would be computed as Yt = Rt + V π θ (St+1). However, this approach neglects most data generated by the planner by only considering the 1-step value at the next states St+1.

Similarly to recent versions of Mu Zero (Schrittwieser et al., 2020; Ye et al., 2021; Danihelka et al., 2022), we instead use the values estimated by the search algorithm ˆVt+1 over V π θ (St+1) to compute these outer TD-targets. Not only does this exploit the planner data, ˆVt is also objectively a better value estimator to use for policy improvement (i.e., within our EM-loop). Namely, the predictions by V π θ give the value of a previous posterior policy (off-policy), whereas ˆVt is an estimate of the current posterior policy (on-policy). During our testing, we found at low particle budgets for the SMC planner that it is important to control the variance of the inner value estimation. For this reason, we compute ˆVt for value-learning using the Retrace(λ) returns (Munos et al., 2016) instead of the importance-weighted (soft) Monte-Carlo estimate used for the policy. This only requires tracking a secondary statistic along with ˆQ(j).

4. Experiments

We introduce our new method, Trust-Region Twisted SMC (TRT-SMC), a variational SMC method that is tailored for planning in RL. We claim that TRT-SMC implements a stronger approximate policy improvement over baseline approaches when used in an expectation-maximization framework (Abdolmaleki et al., 2018; Chan et al., 2022). The contribution of a stronger policy improvement operator can be isolated to 1) enhanced action-selection during training, and 2) improved learning targets (Hamrick et al., 2021). Therefore, we expect our method to yield higher final test returns and steeper learning curves in terms of sampleefficiency (training samples) and runtime efficiency (wall-

clock time). We compared our TRT-SMC against the variational SMC method by Macfarlane et al. (2024), and the current strongest Monte-Carlo tree search (MCTS) method, Gumbel Alpha Zero by Danihelka et al. (2022).

We performed experiments in the Brax continuous control tasks (Freeman et al., 2021) and Jumanji discrete environments (Bonnet et al., 2024), using the authors A2C and SAC results as baselines alongside our PPO implementation (Schulman et al., 2017). Although we compare sampleefficiency to the model-free baselines, this is only for reference since we do not account for the additional observed transitions by the planner in the main results. In other words, we are testing the setting where we assume access to a highly accurate simulator for planning purposes (see Appendix D for additional results that also count the simulator samples). Performance is reported as the average offline return over 128 episodes. Unless otherwise stated, all experiments were repeated (retrained) across 30 seeds, with 99% two-sided BCa-bootstrap confidence intervals (Efron, 1987). For more details, see Appendix B.

4.1. Main Results

We show the evaluation curves for comparing sampleefficiency in Figure 2, where the planner-based methods used a budget of N = 16 transitions. For simplicity, we kept the depth m of the SMC planner uniform to the number of particles K, such that K = m =

N. Our ablations in the subsection 4.2 also show that keeping m and K somewhat in tandem is ideal for SMC. On each individual environment we observe that our TRT-SMC shows steeper and higher evaluation curves than the baseline variational SMC method, which is in line with our expectations.

Additionally, we compare the runtime scaling for additional planning budget in terms of normalized returns on the discrete Jumanji environments in Figure 3. We measured this by aggregating the average returns scaled by their environments known min-max bounds. We see that with more planning budget that the baseline SMC often starts to approach our TRT-SMC, and that the Gumbel MCTS method scales poorly in wallclock time. Most importantly, these results show that our TRT-SMC reliably scales in performance with additional budget, runtime, and training samples.

4.2. Ablations

The main results show that our TRT-SMC method can perform better compared to the baseline model-free, variational SMC, and MCTS methods in terms of sample-efficiency and training runtime. However, we also want to asses: to what extent do each of our separate contributions improve the base method? Therefore, this section quantifies this across the different environments and parameter settings.

Trust-Region Twisted Policy Improvement

0.0 0.5 1.0 1.5 2.0 1e7

TRT-SMC (ours) SMC Gumbel MCTS PPO A2C

0.0 0.5 1.0 1.5 2.0 1e7

1.0 Rubikscube

0 1 2 3 4 5 1e6

Halfcheetah

TRT-SMC (ours) SMC Gumbel MCTS PPO SAC

0 1 2 3 4 5 1e6

Training Samples

Episode Returns

Figure 2. Per environment evaluation curves for a planning budget of N = K m = 16. Shaded regions give 99% two-sided BCabootstrap intervals over 30 seeds. This plot shows improved sample-efficiency of our method when a highly accurate simulator is available.

0 100 200 300 400 500 600 0.0

Budget: 2 2 = 4

TRT-SMC (ours) SMC Gumbel MCTS PPO

0 200 400 600 800 1000 1200

Budget: 4 4 = 16

0 500 1000 1500

Budget: 8 8 = 64

0 500 1000 1500 2000

Budget: 16 16 = 256

Wall-Clock Training Time (seconds)

Normalized Returns

Figure 3. Normalized average curves for the discrete environments over increasing planning budgets to compare performance to runtime. Shaded regions give 99% two-sided BCa-bootstrap intervals over 2 times 30 seeds. Runtime was estimated by multiplying the training step with an interquartile mean of the runtime-per-step (see Appendix B.2 for details and Appendix D for sample-efficiency).

Proposal Distributions. Firstly, we assess the interplay of the proposal distribution with the planning budget through the aggregated final test performance on the Jumanji environments in Figure 4. This compares our TRT-SMC method when using either a twisted proposal distribution with a greediness tolerance of α = 0.1, or α = 0 (which only uses the prior πθ). Similarly to the main results, at low particle budgets we observe significantly improved performance with the trust-region twisted proposal, but this gap decreases for α = 0 at higher particle budgets. It also shows that performance scales favorably when the particle budget and the planner depth are in tandem with eachother. This performance improvement at low particle budgets matches our aim of increasing particle-efficiency.

Secondly, we evaluated our TRT-SMC method varying the α [0, 1] parameter and uniformly scaling up the budget and depth (as done in Figure 3). We aggregated the normalized final test results over all tested environments as given by the sensitivity plot in Figure 5. Although we show the aggregated results here, we found that this pattern highly depends on the specific environment, we show the individual results in Appendix D. In essence, these results show that mixing between the prior πθ and maximizing the predicted state-action value Qπ θ improves performance compared to completely relying on either of them.

Policy Inference. We compare our method for policy inference to that of Eq. 6 by adopting a similar experiment setup as we used for the proposal ablations and varying these two choices. We report the results for this experiment in terms of their evaluation curves in the right of Figure 4 to also observe learning stability. We find that the final performance of the Dirac policy at K = 4 and α = 0 shows decreased final performance when the planner depth is increased from m = 4 to m = 16. As expected, this effect does not occur for our message-passing method, although the proposal twisting seems to diminish this effect also. Most importantly, our approach for computing ˆq

demonstrates a monotone improvement.

Value Targets. To compare the improvement by the search-based value targets, we tested on the Ant environment for experiment variety. We compared three variations of our TRT-SMC for constructing the outer learning targets by linearly interpolating the 1-step Vθ and the SMC-based value estimate VSMC, such that ˆV = σ Vθ+(1 σ) VSMC, where we used σ {0, 1

2, 1}. We aggregated the final mean performances across different planner depths m {4, 8}, particle budgets K {4, 8}, resampling periods r {1, 3}, and whether to use our revived resampling or not. To reduce moving parts, we did not use a twisted proposal (ϵ = 0). In total, across 30 repetitions, this gave us 480 experiments for which we report final marginal performance in the top of

Trust-Region Twisted Policy Improvement

= 0 = 0.1 = 0 = 0.1 = 0 = 0.1 0.0

Normalized Episode Returns

K = 4 K = 8 K = 16

(Twisted) Proposal Comparison

m = 4 m = 8 m = 16

m=4 K=4 m=16 K=4

0.0 0.5 1.0 1.5 2.0 Num Training Samples 1e7

0.0 0.5 1.0 1.5 2.0 Num Training Samples 1e7

Policy Inference Comparison

Dirac = 0.1 Dirac = 0 Backups = 0.1 Backups = 0

Figure 4. Final expected performance (left) and evaluation curves over training (right), for different proposal trust-region levels and policy inference methods. The left barplot only uses our message-passing method (backups) for estimating ˆq , whereas the right plot compares both the Dirac mixture and our method. Both plots show that the constrained proposals α = 0.1 improve performance over the prior proposal α = 0 at low particle budgets. The right plot also shows that our backup method for ˆq does not degenerate with deep planning.

0 0.001 0.01 0.1 0.2 0.4 0.8 Trust-Region Factor *

Normalized Average Returns

Parameter Sensitivity: Discrete + Continuous Envs

N = 16 N = 64 N = 256

Figure 5. Sensitivity plot for the adaptive trust-region parameter α over increasing planning budgets N = K m (where K = m). The y-axis indicates the normalized final test performance, aggregated across all tested environments. The pattern shows that at small planning budgets, a proposal distribution which accounts for the predicted Qπ θ performs marginally better.

Table 1. Although the intervals overlap slightly, there is a trend that favors using SMC data for the learning targets.

Revived Resampling. We evaluated the effect of using the revived particle resampling in a similar, experiment setup to the value-targets ablations. We tested on Jumanji Snake due to its sparse rewards and high likelihood of encountering terminal states (see Appendix B.1). Interestingly, the marginal test results in the bottom of Table 1 show that the revived

Table 1. Confidence intervals for the final expected episode returns on Brax Ant for different value-estimation methods (top) and Jumanji Snake for the two resampling strategies (bottom).

Ablation Value ˆµ ˆqα/2 ˆq1 α/2

Value Targets on Ant Vθ 6911.8 6669.7 7146.7 1 2Vθ + 1

2VSMC 7214.7 6973.9 7455.6 VSMC 7457.1 7200.3 7715.3

Resampling on Snake Baseline 44.7 43.7 45.7 Revived 45.7 44.5 47.7

resampling does not significantly differ from the baseline.

5. Related Work

The connection of reinforcement learning (RL) to the statistical estimation of a probabilistic graphical model (Levine, 2018) has in recent years proven useful in borrowing tools from Bayesian estimation for optimal policy inference. Although we focus on sequential Monte-Carlo (SMC) methods (Hoffman et al., 2007; Pich e et al., 2019), this connection has also been used to exploit stochastic control methods like TD-MPC (Hansen et al., 2024; Theodorou et al., 2010). Similarly, prior work has partially explored some of the modifications that we make to SMC-planners, either in isolation or in different contexts. This paper therefore reinforces the connection between probabilistic inference and RL, but also links it back to recent approaches in Monte-Carlo tree search (MCTS) (Browne et al., 2012; Silver et al., 2018; Schrittwieser et al., 2020; Wang et al., 2024).

As discussed in Section 2.1, our method builds on the varia-

Trust-Region Twisted Policy Improvement

tional SMC approach by Macfarlane et al. (2024). Similarly, they also utilize trust-region methods, but on the neural network parameters during optimization and on the target distribution for SMC. Our setup is more comparable to recent MCTS methods (Danihelka et al., 2022; Wang et al., 2024) since we only impose trust-regions on the proposals and nothing else. In other words, we focus specifically on the policy improvement part of the algorithm. Then, Lioutas et al. (2023) also explore a type of twisted proposals, they sample auxiliary action-particles and weight these with heuristic factors exp Qπ θ before computing transitions (Stuhlm ulller et al., 2015). However, their approach has two issues: they mix the normalization of the auxiliary actions across particles and they do not impose sufficient regularization to trade-off the prior policy and the values.

Finally, our contributions are strongly tied to mitigating path degeneracy in SMC planners, which is a common theme in particle filtering (Chopin & Papaspiliopoulos, 2020). For instance, our estimation of the policy (and values) is similar to the online estimation of any time-separable function described by Moral et al. (2010), which was also motivated by path degeneracy. Although we did not consider it here, there are many promising directions for future work in this area, like using anchor particles (Svensson et al., 2015), adaptive resampling (Naesseth et al., 2019), or Rao-Blackwellisation (Casella & Robert, 1996; Danihelka et al., 2022).

6. Conclusion

This paper tailors a particle filter planner for its use within deep reinforcement learning. Specifically, we address default design choices within variational sequential Monte Carlo that become problematic when applying these methods to perform policy inference. Our contributions take inspiration from recent Monte-Carlo tree search methods to mitigate the path degeneracy problem, make better use of the data generated by the planner, and to improve planning budget utilization. Experiments show that our Trust-Region Twisted sequential Monte-Carlo (TRT-SMC) scales favorably to varying planning budgets in terms of runtime and sample-efficiency over the baseline policy improvement methods. We hope that our approach inspires others to leverage more of the tools from both the planning and Bayesian inference literature, to further enhance the sample-efficiency and runtime properties of reinforcement learning algorithms.

Acknowledgements

Jd V and MS are supported by the AI4b.io program, a collaboration between TU Delft and dsm-firmenich, which is fully funded by dsm-firmenich and the RVO (Rijksdienst voor Ondernemend Nederland).

Impact Statement

This paper advances planning algorithms for use in reinforcement learning. Our improvements specifically enable improved compute scaling, this has the potential to make this field of research more accessible to those with fewer compute resources or improve existing methods at reduced computational cost. This can have diverse societal consequences, none which we feel must be specifically highlighted here.

Abdolmaleki, A., Springenberg, J. T., Tassa, Y., Munos, R., Heess, N., and Riedmiller, M. Maximum a Posteriori Policy Optimisation. In The Sixth International Conference on Learning Representations, 2018.

Antonoglou, I., Schrittwieser, J., Ozair, S., Hubert, T. K., and Silver, D. Planning in Stochastic Environments with a Learned Model. In The Tenth International Conference on Learning Representations, 2022.

Asmussen, S. and Glynn, P. W. Stochastic Simulation: Algorithms and Analysis, volume 57 of Stochastic Modelling and Applied Probability. Springer, New York, NY, 2007. doi: 10.1007/978-0-387-69033-9.

Astr om, K. J. Introduction to Stochastic Control Theory, volume 70 of Dover Books on Electrical Engineering. Dover Publications, New York, NY, 2006.

Bertsekas, D. Lessons from Alpha Zero for Optimal, Model Predictive, and Adaptive Control. Athena Scientific optimization and computation series. Athena Scientific, Nashua, NH, 2022.

Bishop, C. M. Pattern Recognition and Machine Learning (Information Science and Statistics). Springer, New York, NY, 1st edition, 2007.

Bonnet, C., Luo, D., Byrne, D. J., Surana, S., Abramowitz, S., Duckworth, P., Coyette, V., Midgley, L. I., Tegegn, E., Kalloniatis, T., Mahjoub, O., Macfarlane, M., Smit, A. P., Grinsztajn, N., Boige, R., Waters, C. N., Mimouni, M. A. A., Sob, U. A. M., de Kock, R. J., Singh, S., Furelos Blanco, D., Le, V., Pretorius, A., and Laterre, A. Jumanji: a Diverse Suite of Scalable Reinforcement Learning Environments in JAX. In The Twelfth International Conference on Learning Representations, 2024.

Browne, C. B., Powley, E., Whitehouse, D., Lucas, S. M., Cowling, P. I., Rohlfshagen, P., Tavener, S., Perez, D., Samothrakis, S., and Colton, S. A Survey of Monte Carlo Tree Search methods. IEEE Transactions on Computational Intelligence and AI in Games, 4(1):1 43, 2012. doi: 10.1109/TCIAIG.2012.2186810.

Trust-Region Twisted Policy Improvement

Casella, G. and Robert, C. P. Rao-Blackwellisation of Sampling Schemes. Biometrika, 83(1):81 94, 1996.

Chan, A., Silva, H., Lim, S., Kozuno, T., Mahmood, A. R., and White, M. Greedification Operators for Policy Optimization: Investigating Forward and Reverse KL Divergences. Journal of Machine Learning Research, 23(253): 1 79, 2022.

Chopin, N. and Papaspiliopoulos, O. An Introduction to Sequential Monte Carlo. Springer Series in Statistics. Springer International Publishing, Cham, Switzerland, 2020. doi: 10.1007/978-3-030-47845-2.

Danihelka, I., Guez, A., Schrittwieser, J., and Silver, D. Policy improvement by planning with Gumbel. In The Tenth International Conference on Learning Representations, 2022.

de Vries, J. A., Voskuil, K., Moerland, T. M., and Plaat, A. Visualizing Mu Zero Models. In ICML 2021 Workshop on Unsupervised Reinforcement Learning, 2021.

Del Moral, P. Feynman-Kac Formulae. Probability and its Applications. Springer, New York, NY, 2004. doi: 10.1007/978-1-4684-9393-1.

Delft AI Cluster (DAIC). The Delft AI Cluster (DAIC), RRID:SCR 025091, 2024.

Delft High Performance Computing Centre (DHPC). Delft Blue Supercomputer (Phase 2), 2024.

Efron, B. Better Bootstrap Confidence Intervals. Journal of the American Statistical Association, 82(397):171 185, 1987. doi: 10.2307/2289144.

Fawzi, A., Balog, M., Huang, A., Hubert, T., Romera Paredes, B., Barekatain, M., Novikov, A., R. Ruiz, F. J., Schrittwieser, J., Swirszcz, G., Silver, D., Hassabis, D., and Kohli, P. Discovering faster matrix multiplication algorithms with reinforcement learning. Nature, 610(7930): 47 53, 2022. doi: 10.1038/s41586-022-05172-4.

Freeman, C. D., Frey, E., Raichuk, A., Girgin, S., Mordatch, I., and Bachem, O. Brax - A Differentiable Physics Engine for Large Scale Rigid Body Simulation. In Thirtyfifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 1), 2021.

Geist, M., Scherrer, B., and Pietquin, O. A Theory of Regularized Markov Decision Processes. In Proceedings of the 36th International Conference on Machine Learning, volume 97, pp. 2160 2169. PMLR, 2019.

Grill, J.-B., Altch e, F., Tang, Y., Hubert, T., Valko, M., Antonoglou, I., and Munos, R. Monte-Carlo Tree Search as Regularized Policy Optimization. In Proceedings of

the 37th International Conference on Machine Learning, volume 119, pp. 3769 3778. PMLR, 2020.

Gu, S. S., Ghahramani, Z., and Turner, R. E. Neural Adaptive Sequential Monte Carlo. In Advances in Neural Information Processing Systems, volume 28. Curran Associates, Inc., 2015.

Guez, A., Mirza, M., Gregor, K., Kabra, R., Racaniere, S., Weber, T., Raposo, D., Santoro, A., Orseau, L., Eccles, T., Wayne, G., Silver, D., and Lillicrap, T. An Investigation of Model-Free Planning. In Proceedings of the 36th International Conference on Machine Learning, volume 97, pp. 2464 2473. PMLR, 2019.

Haarnoja, T., Zhou, A., Abbeel, P., and Levine, S. Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor. In Proceedings of the 35th International Conference on Machine Learning, volume 80, pp. 1861 1870. PMLR, 2018.

Hamrick, J. B., Friesen, A. L., Behbahani, F., Guez, A., Viola, F., Witherspoon, S., Anthony, T., Buesing, L. H., Veliˇckovi c, P., and Weber, T. On the role of planning in model-based deep reinforcement learning. In The Ninth International Conference on Learning Representations, 2021.

Hansen, N., Su, H., and Wang, X. TD-MPC2: Scalable, Robust World Models for Continuous Control. In The Twelfth International Conference on Learning Representations, 2024.

Hansen, N. A., Su, H., and Wang, X. Temporal Difference Learning for Model Predictive Control. In Proceedings of the 39th International Conference on Machine Learning, pp. 8387 8406. PMLR, 2022.

He, J., Moerland, T. M., de Vries, J. A., and Oliehoek, F. A. What model does Mu Zero learn? In ECAI 2024 - 27th European Conference on Artificial Intelligence,, volume 392 of Frontiers in Artificial Intelligence and Applications, pp. 1599 1606. IOS Press, 2024. doi: 10. 3233/FAIA240666.

Hoffman, M., Doucet, A., Freitas, N., and Jasra, A. Bayesian Policy Learning with Trans-Dimensional MCMC. In Advances in Neural Information Processing Systems, volume 20. Curran Associates, Inc., 2007.

Hubert, T., Schrittwieser, J., Antonoglou, I., Barekatain, M., Schmitt, S., and Silver, D. Learning and Planning in Complex Action Spaces. In Proceedings of the 38th International Conference on Machine Learning, volume 139, pp. 4476 4486. PMLR, 2021.

Trust-Region Twisted Policy Improvement

Lawson, D., Tucker, G., Naesseth, C. A., Maddison, C., Adams, R. P., and Teh, Y. W. Twisted Variational Sequential Monte Carlo. In Third workshop on Bayesian Deep Learning (Neur IPS), 2018.

Levine, S. Reinforcement Learning and Control as Probabilistic Inference: Tutorial and Review. ar Xiv:1805.00909, 2018.

Lioutas, V., Lavington, J. W., Sefas, J., Niedoba, M., Liu, Y., Zwartsenberg, B., Dabiri, S., Wood, F., and Scibior, A. Critic Sequential Monte Carlo. In The Eleventh International Conference on Learning Representations, 2023.

Loshchilov, I. and Hutter, F. Decoupled Weight Decay Regularization. In The Seventh International Conference on Learning Representations, 2019.

Macfarlane, M., Toledo, E., Byrne, D. J., Duckworth, P., and Laterre, A. SPO: Sequential Monte Carlo Policy Optimisation. In The Thirty-eighth Annual Conference on Neural Information Processing Systems, volume 37. Curran Associates, Inc., 2024.

Mankowitz, D. J., Michi, A., Zhernov, A., Gelmi, M., Selvi, M., Paduraru, C., Leurent, E., Iqbal, S., Lespiau, J.- B., Ahern, A., K oppe, T., Millikin, K., Gaffney, S., Elster, S., Broshear, J., Gamble, C., Milan, K., Tung, R., Hwang, M., Cemgil, T., Barekatain, M., Li, Y., Mandhane, A., Hubert, T., Schrittwieser, J., Hassabis, D., Kohli, P., Riedmiller, M., Vinyals, O., and Silver, D. Faster sorting algorithms discovered using deep reinforcement learning. Nature, 618(7964):257 263, 2023. doi: 10.1038/s41586-023-06004-9.

Moral, P. D., Doucet, A., and Singh, S. Forward Smoothing using Sequential Monte Carlo. ar Xiv:1012.5390, 2010.

Munos, R., Stepleton, T., Harutyunyan, A., and Bellemare, M. Safe and Efficient Off-Policy Reinforcement Learning. In Advances in Neural Information Processing Systems, volume 29. Curran Associates, Inc., 2016.

Naesseth, C., Linderman, S., Ranganath, R., and Blei, D. Variational Sequential Monte Carlo. In Proceedings of the Twenty-First International Conference on Artificial Intelligence and Statistics, volume 84, pp. 968 977. PMLR, 2018.

Naesseth, C. A., Lindsten, F., and Sch on, T. B. Elements of Sequential Monte Carlo. Foundations and Trends in Machine Learning, 12(3):307 392, 2019. doi: 10.1561/ 2200000074.

Neal, R. M. and Hinton, G. E. A View of the EM Algorithm that Justifies Incremental, Sparse, and other Variants, pp. 355 368. Springer Netherlands, Dordrecht, 1998. doi: 10.1007/978-94-011-5014-9 12.

Oren, Y., Vadocz, V., Spaan, M. T. J., and Boehmer, W. Epistemic Monte Carlo Tree Search. In The Thirteenth International Conference on Learning Representations, 2025.

Patterson, A., Neumann, S., White, M., and White, A. Empirical Design in Reinforcement Learning. Journal of Machine Learning Research, 25(318):1 63, 2024.

Pich e, A., Thomas, V., Ibrahim, C., Bengio, Y., and Pal, C. Probabilistic Planning with Sequential Monte Carlo methods. In The Seventh International Conference on Learning Representations, 2019.

Pitt, M. K. and Shephard, N. Filtering via Simulation: Auxiliary Particle Filters. Journal of the American Statistical Association, 94(446):590 599, 1999. doi: 10.1080/01621459.1999.10474153.

Rosin, C. D. Multi-armed bandits with episode context. Annals of Mathematics and Artificial Intelligence, 61(3): 203 230, 2011. doi: 10.1007/s10472-011-9258-6.

Schrittwieser, J., Antonoglou, I., Hubert, T., Simonyan, K., Sifre, L., Schmitt, S., Guez, A., Lockhart, E., Hassabis, D., Graepel, T., Lillicrap, T., and Silver, D. Mastering Atari, Go, chess and shogi by planning with a learned model. Nature, 588(7839):604 609, 2020. doi: 10.1038/ s41586-020-03051-4.

Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. Proximal Policy Optimization Algorithms. ar Xiv:1707.06347, 2017.

Seijen, H. and Sutton, R. True Online TD(lambda). In Proceedings of the 31st International Conference on Machine Learning, volume 32, pp. 692 700, Bejing, China, 2014. PMLR.

Silver, D., Hubert, T., Schrittwieser, J., Antonoglou, I., Lai, M., Guez, A., Lanctot, M., Sifre, L., Kumaran, D., Graepel, T., Lillicrap, T., Simonyan, K., and Hassabis, D. A general reinforcement learning algorithm that masters chess, shogi, and go through self-play. Science, 362 (6419):1140 1144, 2018. doi: 10.1126/science.aar6404.

Stuhlm ulller, A., Hawkins, R. X. D., Siddharth, N., and Goodman, N. D. Coarse-to-Fine Sequential Monte Carlo for Probabilistic Programs. ar Xiv:1509.02962, 2015.

Sutton, R. S. and Barto, A. G. Reinforcement Learning: An Introduction. A Bradford Book, Cambridge, MA, 2nd edition, 2018.

Svensson, A., Sch on, T. B., and Kok, M. Nonlinear State Space Smoothing Using the Conditional Particle Filter. IFAC-Papers On Line, 48(28):975 980, 2015. doi: 10. 1016/j.ifacol.2015.12.257.

Trust-Region Twisted Policy Improvement

Theodorou, E., Buchli, J., and Schaal, S. A Generalized Path Integral Control Approach to Reinforcement Learning. Journal of Machine Learning Research, 11(104):3137 3181, 2010.

Toussaint, M. Robot trajectory optimization using approximate inference. In Proceedings of the 26th Annual International Conference on Machine Learning, pp. 1049 1056. Association for Computing Machinery, 2009. doi: 10.1145/1553374.1553508.

Wang, S., Liu, S., Ye, W., You, J., and Gao, Y. Efficient Zero V2: Mastering Discrete and Continuous Control with Limited Data. ar Xiv:2403.00564, 2024.

Williams, G., Aldrich, A., and Theodorou, E. Model Predictive Path Integral Control using Covariance Variable Importance Sampling. ar Xiv:1509.01149, 2015.

Ye, W., Liu, S., Kurutach, T., Abbeel, P., and Gao, Y. Mastering Atari Games with Limited Data. In Advances in Neural Information Processing Systems, volume 34. Curran Associates, Inc., 2021.

Ziebart, B. D. Modeling Purposeful Adaptive Behavior with the Principle of Maximum Causal Entropy. Ph D thesis, Carnegie Mellon University, 7 2010.

Trust-Region Twisted Policy Improvement

A. Derivations

We first restate the factorization of the marginal in Eq. 1 from the main text,

pπ(H1:T ) =

t=1 π(At|St)p(St|St 1, At 1),

with p(S1|A0, S0) = p(S1) being the initial state distribution, p(St+1|St, At) is the transition model, and π(At|St) is the

policy. We denote the set of admissible policies as Π = {π|π : S P(A)}. We drop subscripts for H1:T if the indexing is clear from the text.

A.1. Lower-bound

For completeness, we show below that the factorization of pπ(H1:T |O1:T = 1), by Definition 2.1, recovers the lower-bound term for q shown in Theorem 2.2. This is done through the well-known decomposition of the log-likelihood on the marginal distribution for the outcome variable O1:T {0, 1}T . See Section 2.1 for a description on the meaning of this variable, again we will abbreviate O = 1 simply as O.

Lemma A.1 (Decomposition log-likelihood, c.f., Ch 9.4 of Bishop (2007)).

ln pπ(O) = Epq(H)

t=1 Rt KL(q(a|St) π(a|St))

| {z } Evidence Lower Bound

+ KL ((pq(H) pπ(H|O))

| {z } Evidence Gap

Proof. Assume an importance sampling distribution q Π for π Π such that it has sufficient support over H and pπ(H) > 0 = pq(H) > 0 almost everywhere.

ln pπ(O) = ln pπ(O, H)

ln pπ(O, H)

pπ(H|O) pq(H) pq(H)

= Epq(H) ln pπ(O, H)

pq(H) + Epq(H) ln pq(H) pπ(H|O)

ln p(O|H) + ln pπ(H)

+ Epq(H) ln pq(H) pπ(H|O)

= Epq(H) [ln p(O|H) KL(pq(H) π(H))] + KL(pq(H) pπ(H|O))

t Rt KL(q(a|St) π(a|St))

+ KL(pq(H) pπ(H|O))

where in the last step the transition terms for pπ(H) and pq(H) cancel out. See the work by Levine (2018) for comparison.

The result from Lemma A.1 shows that the log-likelihood is decomposed into an evidence lower-bound and an evidence gap. This inequality becomes tight when q is simply set to the posterior policy q (H) pπ(H|O). Thus, motivating the maximization objective for the lower-bound as given in Theorem 2.2, or equivalently, by minimizing the evidence gap as considered by Levine (2018).

A.2. Regularized Policy Improvement

The result below gives a brief sketch that the expectation-maximization loop (Neal & Hinton, 1998) generates a sequence of regularized Markov decision processes (MDPs) that eventually converges to a locally optimal policy. This result is a nice consequence of the control-as-inference framework. A more general discussion outside of the expectation-maximization framework can be found in the work by Geist et al. (2019).

Trust-Region Twisted Policy Improvement

Lemma A.2 (Regularized Policy Improvement). The solution q to the problem,

t=1 Rt KL(q(a|St) π(a|St))

guarantees a policy improvement in the unregularized MDP, Epq (H)[P

t Rt] Epπ(H)[P

Proof. Lemma A.1 shows that the solution q is equivalent to the posterior policy distribution pπ(H1:T |O1:T ). This implies that for each state s S, we have, q (a|s) π(a|s) exp Qπ soft(s, a). The exponential over Qπ soft(s, a) in the posterior policy q interpolates π to the greedy policy by shifting probability density to actions with larger expected cumulative reward. Thus, q provides a policy improvement over π.

A.3. Proof of Theorem 2.2

Proof of Theorem 2.2. Lemma A.1 shows how the posterior policy coincides with the optimal policy q in a regularized Markov decision process (MDP). Then Lemma A.2 describes that this gives a policy improvement in the unregularized MDP. Iterating this process (e.g., an expectation-maximization loop) yields consecutive improvements to the prior π(n) q (n 1) and guarantees a locally optimal π in the unregularized MDP as n , which also implies KL(q (n) π(n 1)) 0.

A.4. Importance sampling weights

For completeness, we give a detailed derivation for the result presented in Corollary 2.3. This derivation differs from the one given by Pich e et al. (2019) in Appendix A.4. Our derivation corrects for the fact that we don t need to compute an expectation over the transition function for exp V π soft(St) in the denominator. However, this is only a practical difference (i.e., how the algorithm is implemented) to justify our calculation.

Corollary A.3 (Restated Corollary 2.3). Assuming access to the transition model p(St+1|St, At), we obtain the importance sampling weights for pπ(H1:t|O1:T )/pq(H1:t),

wt = wt 1 π(At|St)

q(At|St) exp(Rt) E[exp V π soft(St+1)] exp V π soft(St) ,

Proof. For any statistic f( ), we have,

Epπ(H1:t|O1:T )f(H1:t) = Epq(H1:t) [wt f(H1:t)]

= Epq(H1:t)

pπ(H1:t|O1:T )

pq(H1:t) f(H1:t)

= Epq(H1:t)

pπ(St, At|H<t, O1:T )pπ(H<t|O1:T )

pq(St, At|H<t)pq(H<t)) f(H1:t)

= Epq(H1:t)

wt 1 pπ(St, At|St 1, At 1, Ot:T )

pq(St, At|St 1, At 1) f(H1:t)

where the last step follows from the Markov property. Then, we get,

pπ(St, At|St 1, At 1, Ot:T )

pq(St, At|St 1, At 1) = pπ(At|St, Ot:T )p(St|St 1, At 1)

q(At|St)p(St|St 1, At 1) = pπ(At|St, Ot:T )

q(At|St) pπ(Ot:T |St, At)

pπ(Ot:T |St) = π(At|St)

q(At|St) exp Qπ soft(St, At) exp V π soft(St)

q(At|St) exp(Rt) E[exp V π soft(St+1)] exp V π soft(St) .

Trust-Region Twisted Policy Improvement

B. Experiment Details

Our code can be found at https://github.com/joeryjoery/trtpi.

B.1. Environments

We used the Jumanji 1.0.1 implementations of the Snake-v1 and Rubikscube-partly-scrambled-v0 environments (Bonnet et al., 2024), code is available at https://github.com/instadeepai/jumanji. For the Brax 0.10.5 implementation we used the Ant and Halfcheetah environments using the spring backend, code is available at https://github.com/ google/brax.

Snake environment details:

The observation is a 12x12 image with 5 channels that indicate the position of the fruit and positional features of the snake, it also gives the integer number of steps taken which we encode as a bit-vector for the neural network.

The action space is a choice over 4 integers that indicate moving the snake: up, down, left, or right.

The reward is zero everywhere, except for a +1 when the fruit is picked up.

The agent terminates after 4000 steps or when colliding with itself.

Rubikscube environment details:

The observation is a 3D integer tensor of shape 6 3 3 with values in [0, 1, 2, 3, 4, 5] indicating the color. It also gives the integer number of steps taken which we encode as a bit-vector for the neural network.

The action space is a 3 dimensional integer array to choose the face to turn, depth of the turn, and direction of the turn. For a 3x3 cube this action-space induced 18 combinations.

The reward is zero everywhere, except for a +1 when the puzzle is solved.

The agent terminates after 20 steps or when solving the puzzle.

We used the partly-scrambled version of this environment, which means that the solution is at most 7 actions removed from any of the starting states.

Brax environment details:

The observations are continuous values with 27 dimensions for Ant and 18 dimensions for Halfcheetah.

The action space is a bounded continuous vector between [ 1, 1]D, with D = 8 for Ant and D = 6 for Halfcheetah.

The reward is a dense, but involved formula that penalizes energy expenditure (norm of the action) and rewards the agent for moving in space (in terms of spatial coordinates).

The agent terminates after 4000 steps or when ending up in a unhealthy joint-configuration.

B.2. Hardware Requirements

All experiments were run on a GPU cluster with a mix of NVIDIA Ge Force RTX 2080 TI 11GB, Tesla V100-SXM2 32GB, NVIDIA A40 48GB, and A100 80GB GPU cards (Delft AI Cluster (DAIC), 2024; Delft High Performance Computing Centre , DHPC). Each run (random seed/ repetition) required only a few CPU cores (2 logical cores) with a low memory budget (e.g., 4GB). For our most expensive singular experiments we found that we needed about 6GB of VRAM at most, and that the replay buffer size is the most important parameter in this regard. Roughly speaking, we found that the SMC based agents all completed both training and evaluation under 30 minutes on Snake with a budget of K = 8 particles and a depth of m = 8, the Gumbel MCTS required 3 hours for a budget of N = 64.

Trust-Region Twisted Policy Improvement

Training Time Estimation. To estimate the training runtime in seconds (in Figure 3), we used an estimator of the the runtime-per-step and multiplied this by the current training iteration to obtain a cumulative estimate. For each training configuration, we measured the runtime-per-step and computed an interquartile mean over 1) the random seeds for relative wallcock time and 2) the training iterations themselves. This estimator should more robustly deal with the variations in hardware, the compute clusters background load, and XLA dependent compilation. Of course, estimating runtime is strongly limited to the hardware and software implementation, and our results should only hint towards a trend of improved scaling to parallel compute for the planner algorithms.

B.3. Hyperparameters, Model Training, and Software Versioning

The hyperparameters for our experiments are summarized in the following tables:

Table 3: Shared parameters across experiments.

Table 4: PPO-specific parameters.

Table 5: MCTS-specific parameters.

Table 6: Shared SMC parameters.

Table 7: Parameters for our extended agent.

We underline all default values in bold, all other parameter values indicated in the sets were run in an exhaustive grid for the ablations. The ablation results then report marginal performance over configurations and over seeds (repetitions). These experimental design decisions closely follow the suggestions laid out in the work by Patterson et al. (2024).

Despite conflating all model parameters into one joint set θ, we used separate neural network parameters for the policy, state, and state-action value models. Given the current dataset (replay buffer) within the training loop D(n), the loss is a simple empirical cross-entropy of collective terms,

L(θ) = E(St,At,ˆqt, ˆVt) D(n)

2 ( ˆVt V π θ (St))2 + cv

2 ( ˆVt Qπ θ (St, At))2 cπEa ˆqt ln πθ(a|St) cent H[πθ(a|St)] i

where ˆqt and ˆVt are estimated targets for the policy and value respectively (see main text), and H[πθ] = Eπθ ln πθ is an entropy penalty for the policy. We approximated L with stochastic gradient descent (see the hyperparameter tables), where we always used the Adam W optimizer (Loshchilov & Hutter, 2019) with an l2 penalty of 10 6 and a learning rate of 3 10 3. Gradients were clipped using two methods, in order: a max absolute value of 10 and a global norm limit of 10.

The replay buffer was implemented as a uniform circular buffer with its size calculated as:

max-age of data number of parallel environments number of unroll steps.

The max-age of data was tuned to fit into reasonable GPU memory.

For the A2C baseline, hyperparameters are detailed in Bonnet et al. (2024), and for the SAC baseline, refer to Freeman et al. (2021). Additionally, Table 2 provides version information for key software packages, this also directs towards default hyperparameters of baselines not listed here. We implemented everything based on Jax 0.4.30 in Python 3.12.4.

B.4. Neural Network Architectures

Neural network designs were largely adapted from the A2C reference (Bonnet et al., 2024) and Macfarlane et al. (2024). Minor modifications were made to handle the heterogeneity in environment action spaces. Notably, we standardized network architectures across environments wherever feasible, adjusting input embedding and output construction as needed. The specific configurations are listed below.

Brax Environments

2-layer MLP with 256 nodes per layer.

Trust-Region Twisted Policy Improvement

Table 2. Software module versioning that we used for our experiments (also includes default parameter settings).

Package Version brax 0.10.5 optax 0.2.3 flashbax 0.1.2 rlax 0.1.6 mctx 0.0.5 flax 0.8.4 jumanji 1.0.1

Leaky-Re LU activations followed by Layer Norm.

Outputs parameterized a diagonal multivariate Gaussian squashed via Tanh, as in Haarnoja et al. (2018).

Jumanji Rubiks Cube Environment

2-layer MLP with 256 nodes per layer (same as Brax).

We used a flat representation for the 3-dimensional categorical action space (logits over all item-combinations). In contrast: the A2C baseline used a structured representation with three separate categorical outputs (logits per item).

Jumanji Snake Environment

2-layer MLP with 128 nodes per layer and Leaky-Re LU activations, followed by Layer Norm for the main module.

Based on Bonnet et al. (2024), the input image is embedded using a single 3x3 convolutional layer with:

3 channels. Leaky-Re LU activation. No Layer Norm.

The resulting embedding was flattened before being passed to the main MLP module.

Trust-Region Twisted Policy Improvement

Table 3. Shared experiment hyperparameters. Name Symbol Value Jumanji Value Brax SGD Minibatch size 256 256 SGD update steps 100 64 Unroll length (nr. steps in environment) 64 64 Batch-Size (nr. parallel environments) 128 64 (outer-loop) TD-Lambda λ 0.95 0.9 (outer-loop) Discount γ 0.997 0.99 Value Loss Scale cv 0.5 0.5 Policy Loss Scale cπ 1.0 1.0 Entropy Loss Scale cent 0.1 0.0003

Table 4. Proximal Policy Optimization hyperparameters. We did not use advantage-normalization computed the policy entropy exactly.

Name Symbol Value Jumanji Value Brax Policy-Ratio clipping ϵ 0.3 0.3 Value Loss Scale cv 1.0 0.5 Policy Loss Scale cπ 1.0 1.0 Entropy Loss Scale cent 0.1 0.0003

Table 5. Gumbel Monte-Carlo tree search experiment hyperparameters (Danihelka et al., 2022).

Name Symbol Value Jumanji Value Brax Replay Buffer max-age 64 64 Nr. bootstrap atoms π B 30 30 Search budget N {16, 64} {16, 64} Max depth 16 16 Max breadth 16 16

Table 6. Shared Sequential Monte-Carlo hyperparameters (ours and Macfarlane et al., 2024). Bold values indicate those used in the main results, with the remaining values in the set being explored in the ablations.

Name Symbol Value Jumanji Value Brax Replay Buffer max-age 64 64 Planner Depth m {4, 8, 16} {4, 8} Number of particles K {4, 8, 16} {4, 8} Resampling period r {1, 3} {1, 3} Target temperature (env. reward scale) T {1.0, 0.1} {1.0, 0.1} Nr. bootstrap atoms π B 30 30

Table 7. Trust-Region Twisted Sequential Monte-Carlo hyperparameters (i.e., ours only). The underlined values recover the base SMC.

Name Symbol Value Jumanji Value Brax (inner-loop) Retrace(λ) λSMC 0.95 0.9 (inner-loop) Discount γSMC 0.997 0.99 (outer-loop) Value mixing ˆVt σ 0.5 (0.0) {0.0, 0.5, 1.0} Estimation ˆq {Dirac, Message-Passing} {Dirac, Message-Passing} Revived resampling {False, True} {False, True} Proposal (adaptive) Trust-Region ϵα {0.0, 0.1, 0.3} {0.0, 0.1, 0.3}

Trust-Region Twisted Policy Improvement

B.5. Details on Constrained Proposals

We solve the constrained program in Eq. 7 in the SMC planner in Algorithm 1 for each individual state-particle S(i) t . To deal with general action-spaces (e.g., continuous), we sample B atoms from the prior policy πθ for uniform bootstrapping. As stated in Theorem 2.2 and accompanying text, the Lagrangian of this program can be used to define a Boltzmann policy,

L(q, β 1, η) = Eq Qπ θ + (ϵα β 1KL(q πθ)) + (1 η), (10)

where taking the partial derivatives and setting them to zero gives q πθ exp βQπ θ , which is analytically normalizable (see Grill et al. (2020) for comparison). Given this distribution q , we found that the following minimization problem was the most numerically stable in finding the optimal temperature parameter β 1,

min β 1 ϵα Eq [ln q (a|S) ln πθ(a|S)] 2 2, (11)

which we solve with a bisection search. In combination with bootstrapping from πθ, the above ln πθ(a|S) is essentially an entropy constraint on q . As stated in Subsection 3.1, we set ϵα adaptively based on some greediness tolerance α [0, 1].

B.6. Details on Figure 1

We adapted the top-figure with the colored and grayed out trajectories from Figure 2 of Svensson et al. (2015) to show the discarded data in a naive forward-only sequential Monte-Carlo (SMC) planner. We generated the two bottom figures using our own SMC planner implementation. The KL-divergence was evaluated by: running SMC at timestep 0 on a dummy environment, extracting the sampled policy as by Eq. 6 or from Section 3.3, projecting the sampled logits back to their original action-space, and calculating the KL-divergence of the optimal stochastic policy to the canonical (non-bootstrapped) logits from SMC. Most importantly, this environment had discrete actions and zero reward everywhere, making the dynamics function like an absorbing state. For this reason, the optimal (stochastic) policy is uniformly random with a value V π of zero everywhere. The bottom-left of Figure 1, thus, visualizes the consequence of recursive bootstrapping, which degenerates the policy in terms of KL-divergence from the optimal policy. Our method does not incur this in the bottom-right aside from the effect caused by the particle budget.

C. Pseudocode

We give a simplified overview of our method in Algorithm 2, which is an extended version of the one from the main paper (Algorithm 1). The pseudocode documents our specific contributions in comparison to the base SMC planner.

Note that the implementation for the online tracking of ancestor values, following Moral et al. (2010) is very similar to the eligibility trace known in reinforcement learning (Sutton & Barto, 2018; Seijen & Sutton, 2014). Fundamentally, this can be considered as initializing the eligibilities of the ancestor particles to one, and continuously decaying these eligibilities while accumulating value updates (i.e., without updating the ancestor-eligibility).

Trust-Region Twisted Policy Improvement

Algorithm 2 Bootstrapped Particle Filter for RL (Our TRT-SMC Pseudocode based on Algorithm 1). Require: K (number of particles), m (depth), r (resampling period), α (proposal greediness)

1: Initialize:

Ancestor identifier {J(i) 1 = i}K i=1 States {S(i) 1 p(S1)}K i=1 Reference States {e S(i) 1 S(i) 1 }K i=1 // Revived resampling: track last non-terminal states

Weights { ew(i) 0 = 1}K i=1 Ancestor Log-Probabilities { ˆQ(i) 0 = 0}K i=1 // Policy Inference

2: for t = 1 to m do

// TRT-SMC: Create a set of Trust-Region twisted proposal distributions

3: {q(i) t | where q(i) t solves Eq. 7 for a given α, S(i) t }K i=1

// Default SMC: Update particles

4: {A(i) t q(i) t (At|S(i) t )}K i=1

5: {S(i) t+1 p(St+1|S(i) t , A(i) t )}K i=1

6: { ew(i) t = ew(i) t 1 π(A(i) t |S(i) t )

q(A(i) t |S(i) t ) e R(i) t E exp V π θ (S(i) t+1)

exp V π θ (S(i) t ) }K i=1

// Revived resampling: Track the last non-terminal states

7: {e S(i) t+1

(e S(i) t , If S(i) t+1 is terminal, S(i) t+1, Otherwise, }K i=1

// Policy inference & Value Estimation: online accumulation of ancestor statistics

8: {J (j) {i [1, K] | J(i) t = j}}K j=1

9: ˆQ(j) t = ˆQ(j) t 1 +

0, J (j) t = ,

ln 1 |J (j) t | P

e w(i) t e w(i) t 1

, J (j) t = ,

// Default SMC: Bootstrap (periodically) through resampling

10: if t mod r == 0 then

11: Normalized probability vector: wt[i] = e w(i) t PK j=1 e w(j) t 12: {J(i) t Categorical(wt)}K i=1

// Revived resampling: resample particles by their reference states, then reset the references.

13: {(S(i) t+1, A(i) t )}K i=1 {(e S(J(i) t ) t+1 , A(J(i) t ) t )}K i=1

14: {e S(i) t+1 S(i) t+1}K i=1

15: { ew(i) t = 1}K i=1

17: end for

// Return the estimated values for performing policy inference

18: return { ˆQ(j) m }K i=1

Trust-Region Twisted Policy Improvement

Algorithm 3 Outer EM-loop for Approximate Policy Iteration Require: Initial iterate θ(1) for neural networks, replay buffer D(1).

1: for n = 1 to N do

2: S1 p(S1)

3: for t = 1 to T do

// Inner-loop; Model-Predictive Control (Algorithm 1)

4: {J(i) t:t+m, H(i) t:t+m, ew(i) t:t+m}K i=1 SMC(St; πθ(n), V π θ(n))

5: Estimate ˆq t , a policy using SMC-output (e.g., Eq 6).

6: Estimate ˆV t , a (inner) value using SMC-output (e.g., V π θ(n)(St)).

7: Sample action from search policy At ˆq t // Environment step and data collection

8: St+1 penv( |St, At), Rt R(St, At)

9: Append (St, At, Rt, ˆq t , ˆV t ) to buffer D(n) 10: end for

// Outer loop; learning through SGD

11: Compute (outer) value estimators ˆVt using rewards Rt and (inner) values ˆV t from D(n) (e.g., truncated TD(λ))

12: Update θ(n+1) with SGD on L(θ(n); D(1:n)) (Equation 9)

13: (Optionally: Circularly wrap D(n+1) from D(n))

14: end for

15: return πθ(N+1)

Trust-Region Twisted Policy Improvement

D. Supplementary Results

0.0 0.5 1.0 1.5 2.0 1e7

Budget: 2 2 = 4

TRT-SMC (ours) SMC Gumbel MCTS PPO

0.0 0.5 1.0 1.5 2.0 1e7

Budget: 4 4 = 16

0.0 0.5 1.0 1.5 2.0 1e7

Budget: 8 8 = 64

0.0 0.5 1.0 1.5 2.0 1e7

Budget: 16 16 = 256

Training Samples

Normalized Returns

Figure 6. Normalized expected evaluation curves for the discrete environments over increasing planning budgets. These budgets were calculated as N = K m, where K = m, to keep the number of transitions uniform between the MCTS and SMC methods. Shaded regions give BCa α = 0.01 intervals over 2 times 30 seeds. See also Figure 3 in the main text for a similar comparison to runtime scaling.

1 = 0.1 1 = 0.1 1 = 0.1 0.0

Normalized Episode Returns

K = 4 K = 8 K = 16

(Twisted) Proposal Comparison

m = 4 m = 8 m = 16

Figure 7. Final expected performance for the prior πθ and the regularized proposal with a static temperature of β 1 = 0.1. This plot is identical to the left barplot in Figure 4 and shows that taking values into account for the proposal doesn t directly translate to improved performance (i.e., the approach taken by Lioutas et al., 2023). We found it essential for performance that the temperature β 1 must be set adequately, which we achieved with the adaptive trust-region method.

Trust-Region Twisted Policy Improvement

1.0 Rubikscube

TRT-SMC (ours) SMC Gumbel MCTS PPO A2C

0.0 2.5 5.0 7.5

Halfcheetah

TRT-SMC (ours) SMC Gumbel MCTS PPO SAC

0.0 2.5 5.0 7.5

(Planning-Budget Adjusted) Number of Training Samples

Episode Return

Figure 8. Main results from Figure 2 with the adjusted x-axis to account for additional true environment samples during planning. In terms of sample-efficiency this gives a stark contrast to the result of the main paper, and shows that the model-free methods are more sample efficient. However, methods that utilize planning can always utilize an accurate simulator to skew the x-axis similarly to the result of Figure 2.

0 0.001 0.01 0.1 0.2 0.4 0.8 Trust-Region Factor *

Parameter Sensitivity: Continuous Envs

N = 16 N = 64 N = 256

Trust-Region Factor *

Normalized Average Returns

Parameter Sensitivity: Discrete Envs

N = 16 N = 64 N = 256

0 0.001 0.01 0.1 0.2 0.4 0.8

Figure 9. Sensitivity plot for the adaptive trust-region parameter α split for the discrete (Jumanji) and continuous (Brax) environments. This splits up the results shown in Figure 5 from the main paper.