# policy_gradient_with_tree_expansion__cbef138b.pdf Policy Gradient with Tree Expansion Gal Dalal * 1 Assaf Hallak * 1 Gugan Thoppe 2 Shie Mannor 1 3 Gal Chechik 1 4 Policy gradient methods are notorious for having a large variance and high sample complexity. To mitigate this, we introduce Soft Tree Max a generalization of softmax that employs planning. In Soft Tree Max, we extend the traditional logits with the multi-step discounted cumulative reward, topped with the logits of future states. We analyze Soft Tree Max and explain how tree expansion helps to reduce its gradient variance. We prove that the variance depends on the chosen tree-expansion policy. Specifically, we show that the closer the induced transitions are to being state-independent, the stronger the variance decay. With approximate forward models, we prove that the resulting gradient bias diminishes with the approximation error while retaining the same variance reduction. Ours is the first result to bound the gradient bias for an approximate model. In a practical implementation of Soft Tree Max, we utilize a parallel GPU-based simulator for fast and efficient tree expansion. Using this implementation in Atari, we show that Soft Tree Max reduces the gradient variance by three orders of magnitude. This leads to better sample complexity and improved performance compared to distributed PPO. 1. Introduction Policy Gradient (PG) methods (Sutton et al., 1999) for Reinforcement Learning (RL) are often the first choice for environments that allow numerous interactions at a fast pace (Schulman et al., 2017). Their success is attributed to several factors: they are easy to distribute to multiple workers, require no assumptions on the underlying value function, and have both on-policy and off-policy variants. *Equal contribution 1NVIDIA Research 2Indian Institute of Science 3Technion University 4Bar-Ilan University. Correspondence to: Gal dalal , Assaf Hallak . Proceedings of the 42 nd International Conference on Machine Learning, Vancouver, Canada. PMLR 267, 2025. Copyright 2025 by the author(s). Despite these positive features, PG algorithms are also notoriously unstable due to the high variance of the gradients computed over entire trajectories (Liu et al., 2020; Xu et al., 2020). As a result, PG algorithms tend to be highly inefficient in terms of sample complexity. Several solutions have been proposed to mitigate the high variance issue, including baseline subtraction (Greensmith et al., 2004; Thomas & Brunskill, 2017; Wu et al., 2018), anchor-point averaging (Papini et al., 2018), and other variance reduction techniques (Zhang et al., 2021; Shen et al., 2019; Pham et al., 2020). A second family of algorithms that achieved state-of-the-art results in several domains is based on planning. Planning is exercised primarily in the context of value-based RL and is usually implemented using a Tree Search (TS) (Silver et al., 2016; Schrittwieser et al., 2020). In this work, we combine PG with TS by introducing a parameterized differentiable policy that incorporates tree expansion. Namely, our Soft Tree Max policy replaces the standard policy logits of a state and action, with the expected value of trajectories that originate from these state and action. We consider two variants of Soft Tree Max, one for cumulative reward and one for exponentiated reward. Unlike approaches that incorporate multi-step returns for value function estimation (e.g., n-step TD methods), our work explicitly integrates planning into the policy parameterization itself. This distinction is crucial while n-step returns serve as a Monte Carlo estimation technique for the advantage function A in the gradient estimator E[( log π)A], our Soft Tree Max affects the policy π directly. This enables us to obtain fundamentally different variance reduction properties while keeping the policy gradient framework intact. Combining TS and PG should be done with care given the biggest downside of PG its high gradient variance. This raises questions that were ignored until this work: (i) How to design a PG method based on tree-expansion that is stable and performs well in practice? and (ii) How does the tree-expansion policy affect the PG variance? Here, we analyze Soft Tree Max, and provide a practical methodology to choose the expansion policy to minimize the resulting variance. Our main result shows that a desirable expansion policy is one, under which the induced transition probabilities are similar for each starting state. More generally, we show that the gradient variance of Soft Tree Max decays at a Policy Gradient with Tree Expansion rate of |λ2|d, where d is the depth of the tree and λ2 is the second eigenvalue of the transition matrix induced by the tree expansion policy. This work is the first to prove such a relation between PG variance and tree expansion policy. In addition, we prove that with an approximate forward model, the bias of the gradient is bounded proportionally to the approximation error of the model. To verify our results, we implemented a practical version of Soft Tree Max that exhaustively searches the entire tree and applies a neural network on its leaves. We test our algorithm on a parallelized Atari GPU simulator (Dalton et al., 2020). To enable a tractable deep search, up to depth eight, we also introduce a pruning technique that limits the width of the tree. We do so by sampling only the most promising nodes at each level. We integrate our Soft Tree Max GPU implementation into the popular PPO (Schulman et al., 2017) and compare it to the flat distributed variant of PPO. This allows us to demonstrate the potential benefit of utilizing learned models while isolating the fundamental properties of TS without added noise. In all tested Atari games, our results outperform the baseline and obtain up to 5x more reward. We further show in Section 6 that the associated gradient variance is smaller by three orders of magnitude in all games, demonstrating the relation between low gradient variance and high reward. We summarize our key contributions. (i) We show how to combine two families of So TA approaches: PG and TS by introducing Soft Tree Max: a novel parametric policy that generalizes softmax to planning. Specifically, we propose two variants based on cumulative and exponentiated rewards. (ii) We prove that the gradient variance of Soft Tree Max in its two variants decays with its tree depth. Our analysis sheds new light on the choice of tree expansion policy. It raises the question of optimality in terms of variance versus the traditional regret; e.g., in UCT (Kocsis & Szepesv ari, 2006). (iii) We prove that with an approximate forward model, the gradient bias is proportional to the approximation error, while retaining the variance decay. This quantifies the accuracy required from a learned forward model. (iv) We implement a differentiable deep version of Soft Tree Max that employs a parallelized GPU tree expansion. We demonstrate how its gradient variance is reduced by three orders of magnitude over PPO while obtaining up to 5x reward. 2. Preliminaries Let U denote simplex over the set U. Throughout, we consider a discounted Markov Decision Process (MDP) M = (S, A, P, r, γ, ν), where S is a finite state space of size S, A is a finite action space of size A, r : S A [0, 1] is the reward function, P : S A S is the transition function, γ (0, 1) is the discount factor, and ν RS is the initial state distribution. We denote the transition matrix starting from state s by Ps [0, 1]A S, i.e., [Ps]a,s = P(s |a, s). Similarly, let Rs = r(s, ) RA denote the corresponding reward vector. Separately, let π : S A be a stationary policy. Let P π and Rπ be the induced transition matrix and reward function, respectively, i.e., P π(s |s) = P a π(a|s) Pr(s |s, a) and Rπ(s) = P a π(a|s)r(s, a). Denote the stationary distribution of P π by µπ RS s.t. µ π P π = P π, and the discounted state visitation frequency by dπ so that d π = (1 γ) P t=0 γtν (P π)t. Also, let V π RS be the value function of π defined by V π(s) = Eπ [P t=0 γtr (st, π(st)) | s0 = s], and let Qπ RS A be the Q-function such that Qπ(s, a) = Eπ [r(s, a) + γV π(s )]. Our goal is to find an optimal policy π such that V (s) V π (s) = maxπ V π(s), s S. For the analysis in Section 4, we introduce the following notation. Denote by Θ RS the vector representation of θ(s) s S. For a vector u, denote by exp(u) the coordinate-wise exponent of u and by D(u) the diagonal square matrix with u in its diagonal. For a matrix A, denote its i-th eigenvalue by λi(A). Denote the k-dimensional identity matrix and all-ones vector by Ik and 1k, respectively. Also, denote the trace operator by Tr . Finally, we treat all vectors as column vectors. 2.1. Policy Gradient PG schemes seek to maximize the cumulative reward as a function of the policy πθ(a|s) by performing gradient steps on θ. The celebrated Policy Gradient Theorem (Sutton et al., 1999) states that θν V πθ = Es dπθ ,a πθ( |s) [ θ log πθ(a|s)Qπθ(s, a)] , where ν and d πθ are as defined above. The variance of the gradient is thus Vars dπθ ,a πθ( |s) ( θ log πθ(a|s)Qπθ(s, a)) . (1) In the notation above, we denote the variance of a vector random variable X by Varx (X) = Tr h Ex h (X Ex X) (X Ex X) ii , similarly as in (Greensmith et al., 2004). From now on, we drop the subscript from Var in (1) for brevity. When the action space is discrete, a commonly used parameterized policy is softmax: πθ(a|s) exp (θ(s, a)) , where θ : S A R is a state-action parameterization. 3. Soft Tree Max: Exponent of trajectories We introduce a new family of policies called Soft Tree Max, which are a model-based generalization of the popular softmax. We propose two variants: Cumulative Policy Gradient with Tree Expansion (C-Soft Tree Max) and Exponentiated (E-Soft Tree Max). In both variants, we replace the generic softmax logits θ(s, a) with the score of a trajectory of horizon d starting from (s, a), generated by applying a behavior policy πb. In C-Soft Tree Max, we exponentiate the expectation of the logits. In E-Soft Tree Max, we first exponentiate the logits and then only compute their expectation. Logits. We define the Soft Tree Max logit ℓs,a(d; θ) to be the random variable depicting the score of a trajectory of horizon d starting from (s, a) and following the policy πb: ℓs,a(d; θ) = γ d "d 1 X t=0 γtrt + γdθ(sd) In the above expression, note that s0 = s, a0 = a, at πb( |st) t 1, and rt r (st, at) . For brevity of the analysis, we let the parametric score θ in (2) be state-based, similarly to a value function. Instead, one could use a stateaction input analogous to a Q-function. Thus, Soft Tree Max can be integrated into the two types of implementation of RL algorithms in standard packages. Lastly, the preceding γ d scales the θ parametrization to correspond to its softmax counerpart. C-Soft Tree Max. Given an inverse temperature parameter β, we let C-Soft Tree Max be πC d,θ(a|s) exp [βEπbℓs,a(d; θ)] . (3) C-Soft Tree Max gives higher weight to actions that result in higher expected returns. While standard softmax relies entirely on parametrization θ, C-Soft Tree Max also interpolates a Monte-Carlo portion of the reward. E-Soft Tree Max. The second operator we propose is E-Soft Tree Max: πE d,θ(a|s) Eπb exp [(βℓs,a(d; θ))] ; (4) here, the expectation is taken outside the exponent. This objective corresponds to the exponentiated reward objective which is often used for risk-sensitive RL (Howard & Matheson, 1972; Fei et al., 2021; Noorani & Baras, 2021). The common risk-sensitive objective is of the form log E[exp(δR)], where δ is the risk parameter and R is the cumulative reward. Similarly to that literature, the exponent in (4) emphasizes the most promising trajectories. Soft Tree Max properties. Soft Tree Max is a natural modelbased generalization of softmax. For d = 0, both variants above coincide since (2) becomes deterministic. In that case, for a state-action parametrization, they reduce to standard softmax. When β 0, both variants again coincide and sample actions uniformly (exploration). When β , the policies become deterministic and greedily optimize for the best trajectory (exploitation). For C-Soft Tree Max, the best trajectory is defined in expectation, while for E-Soft Tree Max it is defined in terms of the best sample path. Soft Tree Max behavior policy selection. The choice of behavior policy πb plays a crucial role in the performance of Soft Tree Max. As we show in Section 4, the gradient variance is minimized when the transitions induced by πb result in similar distributions across states, which is achieved when the second eigenvalue of P πb is small. Without specific assumptions on the MDP, a uniform policy that smoothens transition probabilities across all actions serves as a reasonable approximation to this goal. In practice, this leads to better mixing properties in the associated Markov chain. Soft Tree Max convergence. Under regularity conditions, for any parametric policy, PG converges to local optima (Bhatnagar et al., 2009), and thus also Soft Tree Max. For softmax PG, asymptotic (Agarwal et al., 2021) and rate results (Mei et al., 2020b) were recently obtained, by showing that the gradient is strictly positive everywhere (Mei et al., 2020b, Lemmas 8-9). We conjecture that Soft Tree Max satisfies the same property, being a generalization of softmax, but formally proving it is subject to future work. Soft Tree Max gradient. The two variants of Soft Tree Max involve an expectation taken over Sd many trajectories from the root state s and weighted according to their probability. Thus, during the PG training process, the gradient θ log πθ is calculated using a weighted sum of gradients over all reachable states starting from s. Our method exploits the exponential number of trajectories to reduce the variance while improving performance. Indeed, in the next section we prove that the gradient variance of Soft Tree Max decays exponentially fast as a function of the behavior policy πb and trajectory length d. In the experiments in Section 6, we also show how the practical version of Soft Tree Max achieves a significant reduction in the noise of the PG process and leads to faster convergence and higher reward. 4. Theoretical Analysis In this section, we first bound the variance of PG when using the Soft Tree Max policy. Later, we discuss how the gradient bias resulting due to approximate forward models diminishes as a function of the approximation error, while retaining the same variance decay. We show that the variance decreases with the tree depth, and the rate is determined by the second eigenvalue of the transition kernel induced by πb. Specifically, we bound the same expression for variance as appears in (Greensmith et al., 2004, Sec. 3.5) and (Wu et al., 2018, Sec. A, Eq. (21)). Other types of analysis could instead have focused on the estimation aspect in the context of sampling (Zhang et al., 2021; Shen et al., 2019; Pham et al., 2020). Indeed, in Policy Gradient with Tree Expansion our implementation in Section 5, we manage to avoid sampling and directly compute the expectations in Eqs. (3) and (4). As we show later, we do so by leveraging efficient parallel simulation on the GPU in feasible run-time. In our application, due to the nature of the finite action space and quasi-deterministic Atari dynamics (Bellemare et al., 2013), our expectation estimator is noiseless. We encourage future work to account for the finite-sample variance component. We defer all the proofs to Appendix A. We begin with a general variance bound that holds for any parametric policy. Lemma 4.1 (Bound on the policy gradient variance). Let θ log πθ( |s) RA dim(θ) be a matrix whose a-th row is θ log πθ(a|s) . For any parametric policy πθ and function Qπθ : S A R, Var ( θ log πθ(a|s)Qπθ(s, a)) max s,a [Qπθ(s, a)]2 max s θ log πθ( |s) 2 F . Hence, to bound (1), it is sufficient to bound the Frobenius norm θ log πθ( |s) F for any s. Note that Soft Tree Max does not reduce the gradient uniformly, which would have been equivalent to a trivial change in the learning rate. While the gradient norm shrinks, the gradient itself scales differently along the different coordinates. This scaling occurs along different eigenvectors, as a function of problem parameters (P, θ) and our choice of behavior policy (πb), as can be seen in the proof of the upcoming Theorem 4.4. This allows Soft Tree Max to learn a good shrinkage that, while reducing the overall gradient, still updates the policy quickly enough. This reduction in norm and variance resembles the idea of gradient clipping (Zhang et al., 2019), where the gradient is scaled to reduce its variance, thus increasing stability and improving overall performance. A common assumption in the RL literature (Szepesv ari, 2010) that we adopt for the remainder of the section is that the transition matrix P πb, induced by the behavior policy πb, is irreducible and aperiodic. Consequently, its second highest eigenvalue satisfies |λ2(P πb)| < 1. From now on, we divide the variance results for the two variants of Soft Tree Max into two subsections. For C-Soft Tree Max, the analysis is simpler and we provide an exact bound. The case of E-Soft Tree Max is more involved and we provide for it a more general result. In both cases, we show that the variance decays exponentially with the planning horizon. 4.1. Variance of C-Soft Tree Max We express C-Soft Tree Max in vector form as follows. Lemma 4.2 (Vector form of C-Soft Tree Max). For d 1, (3) is given by πC d,θ( |s) = exp h β Cs,d + Ps (P πb)d 1 Θ i 1 A exp h β Cs,d + Ps (P πb)d 1 Θ i, (5) Cs,d = γ d Rs + Ps h=1 γh d (P πb)h 1 # The vector Cs,d RA represents the cumulative discounted reward in expectation along the trajectory of horizon d. This trajectory starts at state s, involves an initial reward dictated by Rs and an initial transition as per Ps. Thereafter, it involves rewards and transitions specified by Rπb and P πb, respectively. Once the trajectory reaches depth d, the score function θ(sd) is applied,. Lemma 4.3 (Gradient of C-Soft Tree Max). The C-Soft Tree Max gradient is given by θ log πC d,θ = β IA 1A(πC d,θ) Ps (P πb)d 1 , in RA S, where for brevity, we drop the s index in the policy above, i.e., πC d,θ πC d,θ( |s). We are now ready to present our first main result: Theorem 4.4 (Variance decay of C-Soft Tree Max). For every Q : S A R, the C-Soft Tree Max policy gradient variance is bounded by Var θ log πC d,θ(a|s)Q(s, a) (1 γ)2 |λ2(P πb)|2(d 1). We provide the full proof in Appendix A.4, and briefly outline its essence here. Proof outline. Lemma 4.1 allows us to bound the variance using a direct bound on the gradient norm. The gradient is given in Lemma 4.3 as a product of three matrices, which we now study from right to left. The matrix P πb is a rowstochastic matrix. Because the associated Markov chain is irreducible and aperiodic, it has a unique stationary distribution. This implies that P πb has one and only one eigenvalue equal to 1; all others have magnitude strictly less than 1. Let us suppose that all these other eigenvalues have multiplicity 1 (the general case with repeated eigenvalues can be handled via Jordan decompositions as in (Pelletier, 1998, Lemma1)). Then, P πb has the spectral decomposition P πb = 1Sµ πb + PS i=2 λiviu i , where λi is the i-th Policy Gradient with Tree Expansion eigenvalue of P πb (ordered in descending order according to their magnitude) and ui and vi are the corresponding left and right eigenvectors, respectively, and therefore (P πb)d 1 = 1Sµ πb + PS i=2 λd 1 i viu i . The second matrix in the gradient relation in Lemma 4.3, Ps, is a rectangular transition matrix that translates the vector of all ones from dimension S to A : Ps1S = 1A. Lastly, the first matrix h IA 1A(πC d,θ) i is a projection whose null-space includes the vector 1A, i.e., h IA 1A(πC d,θ) i 1A = 0. Combining the three properties above when multiplying the three matrices of the gradient, it is easy to see that the first term in the expression for (P πb)d 1 gets canceled, and we are left with bounded summands scaled by λi(P πb)d 1. Recalling that |λi(P πb)| < 1 and that |λ2| |λ3| . . . for i = 2, . . . , S, we obtain the desired result. It s important to note that Soft Tree Max does not reduce the gradient uniformly, which would be equivalent to simply decreasing the learning rate. Rather, the gradient is scaled differently along different eigenvectors, with scaling factors that depend on the MDP structure and behavior policy πb. This non-uniform scaling allows the policy to continue learning effectively while reducing harmful variance. The empirical results in Section 6 demonstrate that this variance reduction leads to faster, more stable convergence rather than slowing it down. Theorem 4.4 guarantees that the variance of the gradient decays with d. More importantly, it also provides a novel insight for choosing the behavior policy πb as the policy that minimizes the absolute second eigenvalue of the P πb. Indeed, the second eigenvalue of a Markov chain relates to its connectivity and its rate of convergence to the stationary distribution (Levin & Peres, 2017). Optimal variance decay. For the strongest reduction in variance, the behavior policy πb should be chosen to achieve an induced Markov chain whose transitions are state-independent. In that case, P πb is a rank one matrix of the form 1Sµ πb, and λ2(P πb) = 0. Then, Var ( θ log πθ(a|s)Q(s, a)) = 0. Naturally, this can only be done for pathological MDPs; see Appendix C.1 for a more detailed discussion. Nevertheless, as we show in Section 5, we choose our tree expansion policy to reduce the variance as best as possible. Worst-case variance decay. In contrast, and somewhat surprisingly, when πb is chosen so that the dynamics is deterministic, there is no guarantee that it will decay exponentially fast. For example, if P πb is a permutation matrix, then λ2(P πb) = 1, and advancing the tree amounts to only updating the gradient of one state for every action, as in the basic softmax. 4.2. Variance of E-Soft Tree Max The proof of the variance bound for E-Soft Tree Max is similar to that of C-Soft Tree Max, but more involved. It also requires the assumption that the reward depends only on the state, i.e. r(s, a) r(s). This is indeed the case in most standard RL environments such as Atari and Mujoco. Lemma 4.5 (Vector form of E-Soft Tree Max). For d 1, (4) is given by πE d,θ( |s) = Es,d exp(βΘ) 1 AEs,d exp(βΘ), (6) D exp(βγh d R) P πb . The vector R above is the S-dimensional vector whose s-th coordinate is r(s). The matrix Es,d RA S has a similar role to Cs,d from (5), but it represents the exponentiated cumulative discounted reward. Accordingly, it is a product of d matrices as opposed to a sum. It captures the expected reward sequence starting from s and then iteratively following P πb. After d steps, we apply the score function on the last state as in (6). Lemma 4.6 (Gradient of E-Soft Tree Max). The E-Soft Tree Max gradient is given by θ log πE d,θ = β IA 1A(πE d,θ) D πE d,θ 1 Es,d D(exp(βΘ)) 1 AEs,d exp(βΘ) RA S, where for brevity, we drop the s index in the policy above, i.e., πE d,θ πE d,θ( |s). This gradient structure is harder to handle than that of C-Soft Tree Max in Lemma 4.3, but here we also can bound the decay of the variance nonetheless. Theorem 4.7 (Variance decay of E-Soft Tree Max). There exists α (0, 1) such that, Var θ log πE d,θ(a|s)Q(s, a) O β2α2d , for every Q. Further, if P πb is reversible or if the reward is constant, then α = |λ2(P πb)|. Theory versus Practice. We demonstrate the above result in simulation. We draw a random finite MDP, parameter vector Θ RS +, and behavior policy πb. We then empirically compute the PG variance of E-Soft Tree Max as given in (1) and compare it to |λ2(P πb)|d. We repeat this experiment three times for different P πb : (i) close to uniform, (ii) drawn Policy Gradient with Tree Expansion 2 4 6 8 10 Depth d Soft Tree Max Gradient variance Permutation: Empirical variance Permutation: Variance bound Random: Empirical variance Random: Variance bound Uniform: Empirical variance Uniform: Variance bound Figure 1. A comparison of the empirical PG variance and our bound for E-Soft Tree Max on randomly drawn MDPs. We present three cases for P πb : (i) close to uniform, (ii) drawn randomly, and (iii) close to a permutation matrix. This experiment verifies the optimal and worse-case rate decay cases. The variance bounds here are taken from Theorem 4.7 where we substitute α = |λ2(P πb)|. To account for the constants, we match the values for the first point in d = 1. randomly, and (iii) close to a permutation matrix. As seen in Figure 1, the empirical variance and our bound match almost identically. This also suggests that α = |λ2(P πb)| in the general case and not only when P πb is reversible or when the reward is constant. 4.3. Bias with an Approximate Forward Model The definition of the two Soft Tree Max variants involves the knowledge of the underlying environment, in particular the value of P and r. However, in practice, we often can only learn approximations of the dynamics from interactions, e.g., using NNs (Ha & Schmidhuber, 2018; Schrittwieser et al., 2020). Let ˆP and ˆr denote the approximate kernel and reward functions, respectively. In this section, we study the consequences of the approximation error on the C-Soft Tree Max gradient. Let ˆπC d,θ be the C-Soft Tree Max policy defined given the approximate forward model introduced above. That is, let ˆπC d,θ be defined exactly as in (5), but using ˆRs, ˆPs, ˆRπb and ˆP πb, instead of their unperturbed counterparts from Section 2. Then, the variance of the corresponding gradient again decays exponentially with a decay rate of λ2( ˆP πb). However, a gradient bias is introduced. In the following, we bound this bias in terms of the approximation error and other problem parameters. The proof is provided in Appendix A.9. Theorem 4.8. Let ϵ be the maximal model mis-specification, i.e., let max{ P ˆP , r ˆr } = ϵ. Then the policy gradient bias due to ˆπC d,θ satisfies θ ν V ˆπC d,θ = (7) O 1 (1 γ)2γd Sβ2dϵ . Proof outline. First, we prove that max{ Rs ˆRs , Ps ˆPs , Rπb ˆRπb , P πb ˆP πb } = O(ϵ). This follows from the fact that the differences above are suitable convex combinations of either the rows of P ˆP or r ˆr. We use the above observation along with the definitions of πC d,θ and ˆπC d,θ given in (5) to show that πC d,θ ˆπC d,θ = O(βdϵ). The proof for the latter builds upon two key facts: (a) (P πb)k ( ˆP πb)k Pk h=1 ˆP πb h 1 ˆP πb P πb pπb k h = O(kϵ) for any k 0, and (b) |ex 1| = O(x) as x 0. Next, we decompose the LHS of (7) to get i=1 ˆX1(s) ˆXi 1(s) Xi(s) ˆXi(s) Xi+1(s) X4(s), where X1(s) = dπC d,θ(s) R, X2(s) = ( θ log πC d,θ( |s)) RS A, X3(s) = D(πC d,θ( |s)) RA A, X4(s) = QπC d,θ(s, ) RA A, and ˆX1(s), . . . , ˆX4(s) are similarly defined with πC d,θ replaced by ˆπC d,θ. Then, we show that, for i = 1, . . . , 4, (i) Xi(s) ˆXi(s) = O(ϵ) and (ii) max{ Xi , ˆXi } is bounded by problem parameters. From this, the desired result follows. To the best of our knowledge, Theorem 4.8 is the first result that bounds the bias of the gradient of a parametric policy due to an approximate model. This theorem reveals an intriguing trade-off in Soft Tree Max: while the variance decays exponentially with tree depth d as O(|λ2(P πb)|d 1) according to Theorem 4.4, the gradient bias with an approximate model grows as O(dγ d). Since γ d grows exponentially with d, there exists an optimal tree depth that balances these opposing effects. The bias also depends on the temperature parameter β, with higher temperature (lower β) reducing bias at the expense of more exploratory policies. In the extreme case of β = 0, the policy becomes uniform with no bias. Additionally, the error scales linearly with d because the policy suffers from cumulative error as it relies on further-looking states in the approximate model. These results suggest that if the learned model is accurate enough, we can expect similar convergence properties for Policy Gradient with Tree Expansion C-Soft Tree Max as we would obtain with the true dynamics. In practice, one should adjust both d and β based on the estimated accuracy of the forward model to achieve the best performance. This is particularly important for practitioners implementing Soft Tree Max with learned forward models. 5. Soft Tree Max: Deep Parallel Implementation Following impressive successes of deep RL (Mnih et al., 2015; Silver et al., 2016), using deep NNs in RL is standard practice. Depending on the RL algorithm, a loss function is defined and gradients on the network weights can be calculated. In PG methods, the scoring function used in the softmax is commonly replaced by a neural network Wθ: πθ(a|s) exp (Wθ(s, a)) . Similarly, we implement Soft Tree Max by replacing θ(s) in (2) with a neural network Wθ(s). Although both variants of Soft Tree Max from Section 3 involve computing an expectation, this can be hard in general. One approach to handle it is with sampling, though these introduce estimation variance into the process. We leave the question of sample-based theory and algorithmic implementations for future work. Instead, in finite action space environments such as Atari, we compute the exact expectation in Soft Tree Max with an exhaustive TS of depth d. Despite the exponential computational cost of spanning the entire tree, recent advancements in parallel GPU-based simulation allow efficient expansion of all nodes at the same depth simultaneously (Dalal et al., 2021; Rosenberg et al., 2022). This is possible when a simulator is implemented on GPU (Dalton et al., 2020; Makoviychuk et al., 2021; Freeman et al., 2021), or when a forward model is learned (Kim et al., 2020; Ha & Schmidhuber, 2018). To reduce the complexity to be linear in depth, we apply tree pruning to a limited width in all levels. We do so by sub-sampling only the most promising branches at each level. Limiting the width drastically improves runtime, and enables respecting GPU memory limits, with only a small sacrifice in performance. To summarize, in the practical Soft Tree Max algorithm we perform an exhaustive tree expansion with pruning to obtain trajectories up to depth d. We expand the tree with equal weight to all actions, which corresponds to a uniform tree expansion policy πb. We apply a neural network on the leaf states, and accumulate the result with the rewards along each trajectory to obtain the logits in (2). Finally, we aggregate the results using C-Soft Tree Max. We leave experiments E-Soft Tree Max for future work on risk-averse RL. For a detailed illustration of our GPU-based tree expansion implementation, see Appendix B.2. In addition, we provide the psudeocode for C-Soft Tree Max in Algorithm 1 and its integration with PPO in Algorithm 2 in Appendix B.3. During training, the gradient propagates to the NN weights of Wθ. When the gradient θ log πd,θ is calculated at each time step, it updates Wθ for all leaf states, similarly to Siamese networks (Bertinetto et al., 2016). An illustration of the policy is given in Figure 2. 6. Experiments We conduct our experiments on multiple games from the Atari simulation suite (Bellemare et al., 2013). As a baseline, we train a PPO (Schulman et al., 2017) agent with 256 GPU workers in parallel (Dalton et al., 2020). For the tree expansion, we employ a GPU breadth-first as in (Dalal et al., 2021). We then train C-Soft Tree Max 1 for depths d = 1 . . . 8, with a single worker. For depths d 3, we limited the tree to a maximum width of 1024 nodes and pruned trajectories with low estimated weights. Since the distributed PPO baseline advances significantly faster in terms of environment steps, for a fair comparison, we ran all experiments for one week on the same machine. For more details see Appendix B. In Figure 3, we plot the reward and variance of Soft Tree Max for each game, as a function of depth. The dashed lines are the results for PPO. Each value is taken after convergence, i.e., the average over the last 20% of the run. The numbers represent the average over five seeds per game. The plot conveys three intriguing conclusions. First, in all games, Soft Tree Max achieves significantly higher reward than PPO. Its gradient variance is also orders of magnitude lower than that of PPO. Second, the reward and variance are negatively correlated and mirror each other in almost all games. This phenomenon demonstrates the necessity of reducing the variance of PG for improving performance. Lastly, each game has a different sweet spot in terms of optimal tree depth. Recall that we limit the run-time in all experiments to one week The deeper the tree, the slower each step and the run consists of less steps. This explains the non-monotone behavior as a function of depth. For a more thorough discussion on the sweet spot of different games, see Appendix B.5. 7. Related Work Reducing variance in PG estimates is essential for improving efficiency and stability. Approaches include baseline subtraction (Sutton et al., 1999; Weaver & Tao, 2001), actiondependent baselines (Wu et al., 2018), sub-sampling techniques like SVRPG (Papini et al., 2018), and natural policy 1We also experimented with E-Soft Tree Max and the results were almost identical. This is due to the quasi-deterministic nature of Atari, which causes the trajectory logits (2) to have almost no variability. We encourage future work on E-Soft Tree Max using probabilistic environments that are risk-sensitive. Policy Gradient with Tree Expansion ($) Policy network Figure 2. Soft Tree Max policy. Our exhaustive parallel tree expansion iterates on all actions at each state up to depth d (= 2 here). The leaf state of every trajectory is used as input to the policy network. The output is then added to the trajectory s cumulative reward as described in (2). I.e., instead of the standard softmax logits, we add the cumulative discounted reward to the policy network output. This policy is differentiable and can be easily integrated into any PG algorithm. In this work, we build on PPO and use its loss function to train the policy network. Figure 3. Reward and Gradient variance: GPU Soft Tree Max (single worker) vs PPO (256 GPU workers). The blue reward plots show the average of 50 evaluation episodes. The red variance plots show the average gradient variance of the corresponding training runs, averaged over five seeds. The dashed lines represent the same for PPO. Note that the variance y-axis is in log-scale. gradient (Kakade, 2001). Multi-step returns for value estimation, such as n-step TD (Sutton et al., 1998) and GAE (Schulman et al., 2015), primarily affect learning targets rather than policy parameterization. Our approach differs by integrating planning directly into the policy structure, making it fundamentally different from both variance reduction techniques and multi-step return methods. Softmax Operator. The softmax policy became a canonical part of PG to the point where theoretical results of PG focus specifically on it (Zhang et al., 2021; Mei et al., 2020b; Li et al., 2021; Ding et al., 2022). Even though we focus on a tree extension to the softmax policy, our methodology is general and can be easily applied to other discrete or continuous parameterized policies as in (Mei et al., 2020a; Miahi et al., 2021; Silva et al., 2019). It s important to distinguish between tree search and our approach of tree expansion. Traditional tree search methods like MCTS identify the best trajectory through a selection process, while our approach explores all possible trajectories up to a certain depth to compute an improved policy. Recent work by Efroni et al. (2018) showed that multi-step greedy policies improve convergence rates, and Protopapas & Barakat (2024) combined policy mirror descent with lookahead planning, though using different mechanisms than our Soft Tree Max policy. Parallel Environments. In this work we used accurate parallel models that are becoming more common with the increasing popularity of GPU-based simulation (Makoviychuk et al., 2021; Dalton et al., 2020; Freeman et al., 2021). Policy Gradient with Tree Expansion Alternatively, in relation to Theorem 4.8, one can rely on recent works that learn the underlying model (Ha & Schmidhuber, 2018; Schrittwieser et al., 2020) and use an approximation of the true dynamics. Risk Aversion. Previous work considered exponential utility functions for risk aversion (Chen et al., 2007; Garcıa & Fern andez, 2015; Fei et al., 2021). This utility function is the same as E-Soft Tree Max formulation from (4), but we have it directly in the policy instead of the objective. Reward-free RL. We showed that the gradient variance is minimized when the transitions induced by the behavior policy πb are uniform. This is expressed by the second eigenvalue of the transition matrix P πb. This notion of uniform exploration is common to the reward-free RL setup (Jin et al., 2020). Several such works also considered the second eigenvalue in their analysis (Liu & Brunskill, 2018; Tarbouriech & Lazaric, 2019). 8. Discussion and Future Work In this work, we introduced for the first time a differentiable parametric policy that combines TS with PG. We proved that Soft Tree Max is essentially a variance reduction technique and explained how to choose the expansion policy to minimize the gradient variance. It is an open question whether optimal variance reduction corresponds to the appealing regret properties the were put forward by UCT (Kocsis & Szepesv ari, 2006). We believe that this can be answered by analyzing the convergence rate of Soft Tree Max, relying on the bias and variance results we obtained here. As the learning process continues, the norm of the gradient and the variance both become smaller. On the face of it, one can ask if the gradient becomes small as fast as the variance or even faster can there be any meaningful learning? As we showed in the experiments, learning happens because the variance reduces fast enough (a variance of 0 represents deterministic learning, which is fastest). Our work can be extended to infinite action spaces, where the theoretical analysis would involve transition kernels rather than matrices while preserving the same nonexpansive operator properties. For implementation, the tree of continuous actions can be expanded by maintaining a parametric distribution over actions depending on θ, similar to a tree adaptation of MPPI (Williams et al., 2017) with a value function. This would significantly broaden the applicability of Soft Tree Max to important domains like robotics and continuous control. Further important extensions include learning the forward model from interactions (Ha & Schmidhuber, 2018; Schrittwieser et al., 2020) rather than using the true dynamics. Our analysis in Theorem 4.8 already provides theoretical guidance on the resulting gradient bias, but empirical validation with learned models would bridge the gap between model- based approaches like Mu Zero and policy gradient methods. Additionally, adapting the behavior policy πb over time as learning progresses could lead to more relevant exploration, though this requires careful management to maintain variance reduction benefits. Finally, the E-Soft Tree Max variant offers potential applications in risk-sensitive RL by naturally emphasizing exceptional trajectories through its reward exponentiation structure. Acknowledgements Gugan Thoppe s research is supported by the Walmart Centre for Tech Excellence at IISc, the Indo-French Centre for the Promotion of Advanced Research CEFIPRA (7102-1), DST SERB s Core Research Grant CRG/2021/008330, and the Pratiksha Trust Young Investigator Award. Impact Statement This paper presents work whose goal is to advance the field of Machine Learning. There are many potential societal consequences of our work, none which we feel must be specifically highlighted here. Reproducibility and Limitations The code for our implementation is available at https://github.com/NVlabs/Soft Tree Max. We provide a docker file for setting up the environment and a README file with instructions on how to run both training and evaluation. The environment engine is an extension of Atari-Cu LE (Dalton et al., 2020), a CUDA-based Atari emulator that runs on GPU. Our usage of a GPU environment is both a novelty and a current limitation of our work. There are additional limitations to consider. First, our analysis focuses on MDPs with finite state and action spaces, and while we discuss theoretical extensions to continuous domains, practical implementations would require careful design choices to manage the sampling process efficiently. Second, while Soft Tree Max can be applied to any environment, it provides most benefit in quasi-deterministic environments where accurate forward planning is possible. In highly stochastic environments, the exponential growth of the tree width with depth would present challenges even with pruning. Agarwal, A., Kakade, S. M., Lee, J. D., and Mahajan, G. On the theory of policy gradient methods: Optimality, approximation, and distribution shift. J. Mach. Learn. Res., 22(98):1 76, 2021. Policy Gradient with Tree Expansion Bellemare, M. G., Naddaf, Y., Veness, J., and Bowling, M. The arcade learning environment: An evaluation platform for general agents. Journal of Artificial Intelligence Research, 47:253 279, 2013. Bertinetto, L., Valmadre, J., Henriques, J. F., Vedaldi, A., and Torr, P. H. Fully-convolutional siamese networks for object tracking. In European conference on computer vision, pp. 850 865. Springer, 2016. Bhatnagar, S., Sutton, R. S., Ghavamzadeh, M., and Lee, M. Natural actor critic algorithms. Automatica, 45(11): 2471 2482, 2009. Chatterjee, S. and Seneta, E. Towards consensus: Some convergence theorems on repeated averaging. Journal of Applied Probability, 14(1):89 97, 1977. Chen, X., Sim, M., Simchi-Levi, D., and Sun, P. Risk aversion in inventory management. Operations Research, 55(5):828 842, 2007. Dalal, G., Hallak, A., Dalton, S., Mannor, S., Chechik, G., et al. Improve agents without retraining: Parallel tree search with off-policy correction. Advances in Neural Information Processing Systems, 34:5518 5530, 2021. Dalton, S. et al. Accelerating reinforcement learning through gpu atari emulation. Advances in Neural Information Processing Systems, 33:19773 19782, 2020. Ding, Y., Zhang, J., and Lavaei, J. On the global optimum convergence of momentum-based policy gradient. In International Conference on Artificial Intelligence and Statistics, pp. 1910 1934. PMLR, 2022. Efroni, Y., Dalal, G., Scherrer, B., and Mannor, S. Beyond the one-step greedy approach in reinforcement learning. In International Conference on Machine Learning, pp. 1387 1396. PMLR, 2018. Fei, Y., Yang, Z., Chen, Y., and Wang, Z. Exponential bellman equation and improved regret bounds for risksensitive reinforcement learning. Advances in Neural Information Processing Systems, 34:20436 20446, 2021. Freeman, C. D., Frey, E., Raichuk, A., Girgin, S., Mordatch, I., and Bachem, O. Brax-a differentiable physics engine for large scale rigid body simulation. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 1), 2021. Garcıa, J. and Fern andez, F. A comprehensive survey on safe reinforcement learning. Journal of Machine Learning Research, 16(1):1437 1480, 2015. Greensmith, E., Bartlett, P. L., and Baxter, J. Variance reduction techniques for gradient estimates in reinforcement learning. Journal of Machine Learning Research, 5(9), 2004. Ha, D. and Schmidhuber, J. World models. ar Xiv preprint ar Xiv:1803.10122, 2018. Howard, R. A. and Matheson, J. E. Risk-sensitive markov decision processes. Management science, 18(7):356 369, 1972. Jin, C., Krishnamurthy, A., Simchowitz, M., and Yu, T. Reward-free exploration for reinforcement learning. In International Conference on Machine Learning, pp. 4870 4879. PMLR, 2020. Kakade, S. M. A natural policy gradient. Advances in neural information processing systems, 14, 2001. Kim, S. W., Zhou, Y., Philion, J., Torralba, A., and Fidler, S. Learning to simulate dynamic environments with gamegan. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1231 1240, 2020. Kocsis, L. and Szepesv ari, C. Bandit based monte-carlo planning. In European conference on machine learning, pp. 282 293. Springer, 2006. Levin, D. A. and Peres, Y. Markov chains and mixing times, volume 107. American Mathematical Soc., 2017. Li, G., Wei, Y., Chi, Y., Gu, Y., and Chen, Y. Softmax policy gradient methods can take exponential time to converge. In Conference on Learning Theory, pp. 3107 3110. PMLR, 2021. Liu, Y. and Brunskill, E. When simple exploration is sample efficient: Identifying sufficient conditions for random exploration to yield pac rl algorithms. ar Xiv preprint ar Xiv:1805.09045, 2018. Liu, Y., Zhang, K., Basar, T., and Yin, W. An improved analysis of (variance-reduced) policy gradient and natural policy gradient methods. Advances in Neural Information Processing Systems, 33:7624 7636, 2020. Makoviychuk, V., Wawrzyniak, L., Guo, Y., Lu, M., Storey, K., Macklin, M., Hoeller, D., Rudin, N., Allshire, A., Handa, A., et al. Isaac gym: High performance gpu-based physics simulation for robot learning. ar Xiv preprint ar Xiv:2108.10470, 2021. Mathkar, A. S. and Borkar, V. S. Nonlinear gossip. SIAM Journal on Control and Optimization, 54(3):1535 1557, 2016. Mei, J., Xiao, C., Dai, B., Li, L., Szepesv ari, C., and Schuurmans, D. Escaping the gravitational pull of softmax. Advances in Neural Information Processing Systems, 33: 21130 21140, 2020a. Policy Gradient with Tree Expansion Mei, J., Xiao, C., Szepesvari, C., and Schuurmans, D. On the global convergence rates of softmax policy gradient methods. In International Conference on Machine Learning, pp. 6820 6829. PMLR, 2020b. Miahi, E., Mac Queen, R., Ayoub, A., Masoumzadeh, A., and White, M. Resmax: An alternative soft-greedy operator for reinforcement learning. 2021. Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., Graves, A., Riedmiller, M., Fidjeland, A. K., Ostrovski, G., et al. Human-level control through deep reinforcement learning. nature, 518(7540): 529 533, 2015. Noorani, E. and Baras, J. S. Risk-sensitive reinforce: A monte carlo policy gradient algorithm for exponential performance criteria. In 2021 60th IEEE Conference on Decision and Control (CDC), pp. 1522 1527. IEEE, 2021. Papini, M., Binaghi, D., Canonaco, G., Pirotta, M., and Restelli, M. Stochastic variance-reduced policy gradient. In International conference on machine learning, pp. 4026 4035. PMLR, 2018. Pelletier, M. On the almost sure asymptotic behaviour of stochastic algorithms. Stochastic processes and their applications, 78(2):217 244, 1998. Pham, N., Nguyen, L., Phan, D., Nguyen, P. H., Dijk, M., and Tran-Dinh, Q. A hybrid stochastic policy gradient algorithm for reinforcement learning. In International Conference on Artificial Intelligence and Statistics, pp. 374 385. PMLR, 2020. Protopapas, K. and Barakat, A. Policy mirror descent with lookahead. Advances in Neural Information Processing Systems, 37:26443 26481, 2024. Raffin, A., Hill, A., Ernestus, M., Gleave, A., Kanervisto, A., and Dormann, N. Stable baselines3, 2019. Rosenberg, A., Hallak, A., Mannor, S., Chechik, G., and Dalal, G. Planning and learning with adaptive lookahead. ar Xiv preprint ar Xiv:2201.12403, 2022. Schrittwieser, J., Antonoglou, I., Hubert, T., Simonyan, K., Sifre, L., Schmitt, S., Guez, A., Lockhart, E., Hassabis, D., Graepel, T., et al. Mastering atari, go, chess and shogi by planning with a learned model. Nature, 588(7839): 604 609, 2020. Schulman, J., Moritz, P., Levine, S., Jordan, M., and Abbeel, P. High-dimensional continuous control using generalized advantage estimation. ar Xiv preprint ar Xiv:1506.02438, 2015. Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. Proximal policy optimization algorithms. ar Xiv preprint ar Xiv:1707.06347, 2017. Shen, Z., Ribeiro, A., Hassani, H., Qian, H., and Mi, C. Hessian aided policy gradient. In International conference on machine learning, pp. 5729 5738. PMLR, 2019. Silva, A., Killian, T., Rodriguez, I. D. J., Son, S.-H., and Gombolay, M. Optimization methods for interpretable differentiable decision trees in reinforcement learning. ar Xiv preprint ar Xiv:1903.09338, 2019. Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al. Mastering the game of go with deep neural networks and tree search. nature, 529(7587):484 489, 2016. Sutton, R. S., Barto, A. G., et al. Reinforcement learning: An introduction, volume 1. MIT press Cambridge, 1998. Sutton, R. S., Mc Allester, D., Singh, S., and Mansour, Y. Policy gradient methods for reinforcement learning with function approximation. Advances in neural information processing systems, 12, 1999. Szepesv ari, C. Algorithms for reinforcement learning. Synthesis lectures on artificial intelligence and machine learning, 4(1):1 103, 2010. Tarbouriech, J. and Lazaric, A. Active exploration in markov decision processes. In The 22nd International Conference on Artificial Intelligence and Statistics, pp. 974 982. PMLR, 2019. Thomas, P. S. and Brunskill, E. Policy gradient methods for reinforcement learning with function approximation and action-dependent baselines. ar Xiv preprint ar Xiv:1706.06643, 2017. Weaver, L. and Tao, N. The optimal reward baseline for gradient-based reinforcement learning. In Proceedings of the Seventeenth conference on Uncertainty in artificial intelligence, pp. 538 545, 2001. Williams, G., Wagener, N., Goldfain, B., Drews, P., Rehg, J. M., Boots, B., and Theodorou, E. A. Information theoretic mpc for model-based reinforcement learning. In 2017 IEEE International Conference on Robotics and Automation (ICRA), pp. 1714 1721. IEEE, 2017. Wu, C., Rajeswaran, A., Duan, Y., Kumar, V., Bayen, A. M., Kakade, S., Mordatch, I., and Abbeel, P. Variance reduction for policy gradient with action-dependent factorized baselines. In International Conference on Learning Representations, 2018. Policy Gradient with Tree Expansion Xu, P., Gao, F., and Gu, Q. An improved convergence analysis of stochastic variance-reduced policy gradient. In Uncertainty in Artificial Intelligence, pp. 541 551. PMLR, 2020. Zhang, J., He, T., Sra, S., and Jadbabaie, A. Why gradient clipping accelerates training: A theoretical justification for adaptivity. ar Xiv preprint ar Xiv:1905.11881, 2019. Zhang, J., Ni, C., Szepesvari, C., Wang, M., et al. On the convergence and sample efficiency of variance-reduced policy gradient method. Advances in Neural Information Processing Systems, 34:2228 2240, 2021. Policy Gradient with Tree Expansion A.1. Proof of Lemma 4.1 Bound on the policy gradient variance For any parametric policy πθ and function Q : S A R, Var ( θ log πθ(a|s)Q(s, a)) max s,a [Q(s, a)]2 max s θ log πθ( |s) 2 F , where θ log πθ( |s) RA dim(θ) is a matrix whose a-th row is θ log πθ(a|s) . Proof. The variance for a parametric policy πθ is given as follows: Var ( θ log πθ(a|s)Q(a, s)) =Es dπθ ,a πθ( |s) θ log πθ(a|s) θ log πθ(a|s)Q(s, a)2 Es dπθ ,a πθ( |s) [ θ log πθ(a|s)Q(s, a)] Es dπθ ,a πθ( |s) [ θ log πθ(a|s)Q(s, a)] , where Q(s, a) is the currently estimated Q-function and dπθ is the discounted state visitation frequency induced by the policy πθ. Since the second term we subtract is always positive (it is of quadratic form v v) we can bound the variance by the first term: Var ( θ log πθ(a|s)Q(a, s)) Es dπθ ,a πθ( |s) θ log πθ(a|s) θ log πθ(a|s)Q(s, a)2 a πθ(a|s) θ log πθ(a|s) θ log πθ(a|s)Q(s, a)2 h [Q(s, a)]2 πθ(a|s) i X a θ log πθ(a|s) θ log πθ(a|s) max s,a [Q(s, a)]2 max s a θ log πθ(a|s) θ log πθ(a|s) = max s,a [Q(s, a)]2 max s θ log πθ( |s) 2 F . A.2. Proof of Lemma 4.2 Vector form of C-Soft Tree Max In vector form, (3) is given by πC d,θ( |s) = exp h β Cs,d + Ps (P πb)d 1 Θ i 1 A exp h β Cs,d + Ps (P πb)d 1 Θ i, (8) Cs,d = γ d Rs + Ps h=1 γh d (P πb)h 1 # Proof. Consider the vector ℓs, R|A|. Its expectation satisfies Eπbℓs, (d; θ) = Eπb "d 1 X t=0 γt drt + θ(sd) t=1 γt d Ps(P πb)t 1Rπb + Ps(P πb)d 1Θ. As required. Policy Gradient with Tree Expansion A.3. Proof of Lemma 4.3 Gradient of C-Soft Tree Max The C-Soft Tree Max gradient of dimension A S is given by θ log πC d,θ = β IA 1A(πC d,θ) Ps (P πb)d 1 , where for brevity, we drop the s index in the policy above, i.e., πC d,θ πC d,θ( |s). Proof. The (j, k)-th entry of θ log πC d,θ satisifes [ θ log πC d,θ]j,k = log(πC d,θ(aj|s)) = β[Ps(P πb)d 1]j,k a h exp h β Cs,d + Ps (P πb)d 1 Θ ii a β Ps(P πb)d 1 1 A exp h β Cs,d + Ps (P πb)d 1 Θ i = β[Ps(P πb)d 1]j,k β X a πC d,θ(a|s) Ps(P πb)d 1 = β[Ps(P πb)d 1]j,k β (πC d,θ) Ps(P πb)d 1 k = β[Ps(P πb)d 1]j,k β 1A(πC d,θ) Ps(P πb)d 1 Moving back to matrix form, we obtain the stated result. A.4. Proof of Theorem 4.4 Exponential variance decay of C-Soft Tree Max The C-Soft Tree Max policy gradient is bounded by Var θ log πC d,θ(a|s)Q(s, a) 2 A2S2β2 (1 γ)2 |λ2(P πb)|2(d 1). Proof. We use Lemma 4.1 directly. First of all, it is know that when the reward is bounded in [0, 1], the maximal value of the Q-function is 1 1 γ as the sum as infinite discounted rewards. Next, we bound the Frobenius norm of the term achieved in Lemma 4.3, by applying the eigen-decomposition on P πb: P πb = 1Sµ + i=2 λiuiv i , (10) where µ is the stationary distribution of P πb, and ui and vi are left and right eigenvectors correspondingly. β IA,A 1Aπ Ps(P πb)d 1 F = β IA,A 1Aπ Ps i=2 λd 1 i uiv i (Ps is stochastic) = β IA,A 1Aπ i=2 λd 1 i Psuiv i (projection nullifies 1Aµ ) = β IA,A 1Aπ S X i=2 λd 1 i Psuiv i (triangle inequality) β i=2 IA,A 1Aπ λd 1 i Psuiv i F (matrix norm sub-multiplicativity) β|λd 1 2 | i=2 IA,A 1Aπ F Ps F uiv i F = β|λd 1 2 |(S 1) IA,A 1Aπ F Ps F . Policy Gradient with Tree Expansion Now, we can bound the norm IA,A 1Aπ F by direct calculation: IA,A 1Aπ 2 F = Tr h IA,A 1Aπ IA,A 1Aπ i (11) = Tr h IA,A 1Aπ π1 A + π π1A1 A i (12) = A 1 1 + Aπ π (13) From the Cauchy-Schwartz inequality, s [[Ps]a,s]2 = X a [Ps]a, 2 2 X a [Ps]a, 1 [Ps]a, A. Var θ log πC d,θ(a|s)Q(s, a) max s,a [Q(s, a)]2 max s θ log πC d,θ( |s) 2 F 1 (1 γ)2 β IA,A 1Aπ Ps(P πb)d 1 2 F 1 (1 γ)2 β2|λ2(P πb)|2(d 1)S2(2A2), which obtains the desired bound. A.5. A lower bound on C-Soft Tree Max gradient (result not in the paper) For completeness we also supply a lower bound on the Frobenius norm of the gradient. Note that this result does not translate to the a lower bound on the variance since we have no lower bound equivalence of Lemma 4.1. Lemma A.1. The Frobenius norm on the gradient of the policy is lower-bounded by: θ log πC d,θ( |s) F C β|λ2(P πb)|(d 1). (15) Proof. We begin by moving to the induced l2 norm by norm-equivalence: β IA,A 1Aπ Ps(P πb)d 1 F β IA,A 1Aπ Ps(P πb)d 1 2. Now, taking the vector u to be the eigenvector of the second eigenvalue of P πb: β IA,A 1Aπ Ps(P πb)d 1 2 β IA,A 1Aπ Ps(P πb)d 1u 2 = β IA,A 1Aπ Psu 2 = β|λ2(P πb)|(d 1) IA,A 1Aπ Psu 2. Note that even though Psu can be 0, that is not the common case since we can freely change πb (and therefore the eigenvectors of P πb). A.6. Proof of Lemma 4.5 Vector form of E-Soft Tree Max For d 1, (4) is given by πE d,θ( |s) = Es,d exp(βΘ) 1 AEs,d exp(βΘ), (16) D exp[βγh d R] P πb (17) with R being the |S|-dimensional vector whose s-th coordinate is r(s). Policy Gradient with Tree Expansion Proof. Recall that ℓs,a(d; θ) = γ d " t=1 γtr(st) + γdθ(sd) and, hence, exp[βℓs,a(d; θ)] = exp t=1 γtr(st) + γdθ(sd) E[exp βℓs,a(d; θ)] = E t=1 γtr(st) E [exp [β (θ(sd))]|s1, . . . , sd 1] t=1 γtr(st) P πb( |sd 1) exp(βΘ) (21) t=1 γtr(st) exp[βγ 1r(sd 1)]P πb( |sd 1) exp(βΘ). (22) By repeatedly using iterative conditioning as above, the desired result follows. Note that exp(βγ dr(s)) does not depend on the action and is therefore cancelled out with the denominator. A.7. Proof of Lemma 4.6 Gradient of E-Soft Tree Max The E-Soft Tree Max gradient of dimension A S is given by θ log πE d,θ = β IA 1A(πE d,θ) D πE d,θ 1 Es,d D(exp(βΘ)) 1 AEs,d exp(βΘ) , where for brevity, we drop the s index in the policy above, i.e., πE d,θ πE d,θ( |s). Proof. The (j, k)-th entry of θ log πE d,θ satisfies [ θ log πE d,θ]j,k = log(πE d,θ(aj|s)) θ(sk) log[(Es,d) j exp(βΘ)] log[1 AEs,d exp(βΘ)] = β(Es,d)j,k exp(βθ(sk)) (Es,d) j exp(βΘ) β1 AEs,dek exp(βθ(sk)) 1 AEs,d exp(βΘ) = β(Es,dek exp(βθ(sk)))j (Es,d) j exp(βΘ) β1 AEs,dek exp(βθ(sk)) 1 AEs,d exp(βΘ) " e j e j Es,d exp(βΘ) 1 A 1 AEs,d exp(βΘ) Es,dek exp(βθ(sk)). [ θ log πE d,θ] ,k = β h D(Es,d exp(βΘ)) 1 (1 AEs,d exp(βΘ)) 11A1 A i Es,dek exp(βθ(sk)) From this, it follows that θ log πE d,θ = β h D πE d,θ 1 1A1 A i Es,d D(exp(βΘ)) 1 AEs,d exp(βΘ) . (23) The desired result is now easy to see. Policy Gradient with Tree Expansion A.8. Proof of Theorem 4.7 Exponential variance decay of E-Soft Tree Max There exists α (0, 1) such that, for any function Q : S A R, Var θ log πE d,θ(a|s)Q(s, a) O β2α2d . If all rewards are equal (r const), then α = |λ2(P πb)|. Proof outline. Recall that thanks to Lemma 4.1, we can bound the PG variance using a direct bound on the gradient norm. The definition of the induced norm is θ log πE d,θ = max z: z =1 θ log πE d,θz , with θ log πE d,θ given in Lemma 4.6. Let z RS be an arbitrary vector such that z = 1. Then, z = PS i=1 cizi, where ci are scalar coefficients and zi are vectors spanning the S-dimensional space. In the full proof, we show our specific choice of zi and prove they are linearly independent given that choice. We do note that z1 = 1S. The first part of the proof relies on the fact that ( θ log πE d,θ)z1 = 0. This is easy to verify using Lemma 4.6 together with (6), and because h IA 1A(πE d,θ) i is a projection matrix whose null-space is spanned by 1S. Thus, θ log πE d,θz = θ log πE d,θ In the second part of the proof, we focus on Es,d from (6), which appears within θ log πE d,θ. Notice that Es,d consists of the product Qd 1 h=1 D exp(βγh d R P πb . Even though the elements in this product are not stochastic matrices, in the full proof we show how to normalize each of them to a stochastic matrix Bh. We thus obtain that Es,d = Ps D(M1) where M1 RS is some strictly positive vector. Then, we can apply a result by Mathkar & Borkar (2016), which itself builds on (Chatterjee & Seneta, 1977). The result states that the product of stochastic matrices Qd 1 h=1 Bh of our particular form converges exponentially fast to a matrix of the form 1Sµ s.t. 1Sµ Qd 1 h=1 Bh Cαd for some constant C. Lastly, 1Sµ πb gets canceled due to our choice of zi, i = 2, . . . , S. This observation along with the above fact that the remainder decays then shows that θ log πE d,θ PS i=2 zi = O(αd), which gives the desired result. Full technical proof. Let d 2. Recall that D exp[βγh d R] P πb , (24) and that R refers to the S-dimensional vector whose s-th coordinate is r(s). ( P πb if i = d 1, D 1(P πb Mi+1)P πb D(Mi+1) if i = 1, . . . , d 2, (25) and the vector ( exp(βγ 1R) if i = d 1, exp(βγi d R) P πb Mi+1 if i = 1, . . . , d 2, (26) where denotes the element-wise product. Then, Es,d = Ps D(M1) i=1 Bi. (27) Policy Gradient with Tree Expansion It is easy to see that each Bi is a row-stochastic matrix, i.e., all entries are non-negative and Bi1S = 1S. Next, we prove that all non-zeros entries of Bi are bounded away from 0 by a constant. This is necessary to apply the next result from (Chatterjee & Seneta, 1977). The j-th coordinate of Mi satisfies (Mi)j = exp[βγi d Rj] X k [P πb]j,k(Mi+1)k exp[βγi d R] Mi+1 . (28) Separately, observe that Md 1 exp(βγ 1R) . Plugging these relations in (26) gives h=1 exp[βγh d R] = h=1 exp[βγ d R] γh = exp[βγ d R] Pd 1 h=1 γh exp[βγ d R] 1 1 γ . (29) Similarly, for every 1 i d 1, we have that h=i exp[βγ d R] γh exp[βγ d R] 1 1 γ . (30) The jk-th entry of Bi = D 1(P πb Mi+1)P πb D(Mi+1) is (Bi)jk = P πb jk [Mi+1]k P|S| ℓ=1 P πb jℓ[Mi+1]ℓ P πb jk P|S| ℓ=1 P πb jℓ[Mi+1]ℓ P πb jk exp[βγ d R] 1 1 γ . (31) Hence, for non-zero P πb jk , the entries are bounded away from zero by the same. We can now proceed with applying the following result. Now, by (Chatterjee & Seneta, 1977, Theorem 5) (see also (14) in (Mathkar & Borkar, 2016)), limd Qd 1 i=1 Bi exists and is of the form 1Sµ for some probability vector µ. Furthermore, there is some α (0, 1) such that ε(d) := Qd 1 i=1 Bi 1S µ satisfies ε(d) = O(αd). (32) Pick linearly independent vectors w2, . . . , w S such that µ wi = 0 for i = 2, . . . , d. (33) Since PS i=2 αiwi is perpendicular to µ for any α2, . . . αS and because µ exp(βΘ) > 0, there exists no choice of α2, . . . , αS such that PS i=2 αiwi = exp(βΘ). Hence, if we let z1 = 1S and zi = D(exp(βΘ)) 1wi for i = 2, . . . , S, then it follows that {z1, . . . , z S} is linearly independent. In particular, it implies that {z1, . . . , z S} spans RS. Policy Gradient with Tree Expansion Now consider an arbitrary unit norm vector z := PS i=1 cizi RS s.t. z 2 = 1. Then, θ log πE d,θz = θ log πE d,θ i=2 cizi (34) = β IA 1A(πE d,θ) D πE d,θ 1 Es,d D(exp(βΘ)) 1 AEs,d exp(βΘ) i=2 cizi (35) = β IA 1A(πE d,θ) D πE d,θ 1 Es,d 1 AEs,d exp(βΘ) i=2 ciwi (36) = β IA 1A(πE d,θ) D πE d,θ 1 1Sµ + ε(d) 1 AEs,d exp(βΘ) i=2 ciwi (37) = β IA 1A(πE d,θ) D πE d,θ 1 ε(d) 1 AEs,d exp(βΘ) i=2 ciwi (38) = β IA 1A(πE d,θ) D πE d,θ 1 ε(d)D(exp(βΘ)) 1 AEs,d exp(βΘ) (z c11S), (39) where (34) follows from the fact that θ log πE d,θz1 = θ log πE d,θ1S = 0, (35) follows from Lemma 4.6, (36) holds since zi = D(exp(βΘ)) 1wi, (38) because µ is perpendicular wi for each i, while (39) follows by reusing zi = D(exp(βΘ)) 1wi relation along with the fact that z1 = 1S. From (39), it follows that θ log πE d,θz β ε(d) IA 1A(πE d,θ) D πE d,θ 1 1 AEs,d exp(βΘ) D(exp(βΘ)) z c11S (40) βαd( IA + 1A(πE d,θ) ) 1 AEs,d exp(βΘ) exp(β max s θ(s)) z c11S (41) 1 AEs,d exp(βΘ) exp(β max s θ(s)) z c11S (42) A) D 1(Es,d exp(βΘ)) exp(β max s θ(s)) z c11S (43) A) 1 mins[Es,d exp(βΘ]s exp(β max s θ(s)) z c11S (44) A) exp(β maxs θ(s)) exp(β mins θ(s)) mins |M1| z c11S (45) A) exp(β maxs θ(s)) exp(β mins θ(s)) exp(β mins r(s)) z c11S (46) A) exp(β[max s θ(s) min s θ(s) min s r(s)]) z c11S . (47) Lastly, we prove that z c11S is bounded independently of d. First, denote by c = (c1, . . . , c S) and c = (0, c2, . . . , c S) . Policy Gradient with Tree Expansion Also, denote by Z the matrix with zi as its i-th column. Now, i=2 cizi (48) = Z Z 1z (52) Z Z 1 , (53) where the last relation is due to z being a unit vector. All matrix norms here are l2-induced norms. Next, denote by W the matrix with wi in its i-th column. Recall that in (33) we only defined w2, . . . , w S. We now set w1 = exp(βΘ). Note that w1 is linearly independent of {w2, . . . , w S} because of (33) together with the fact that µ w1 > 0. We can now express the relation between Z and W by Z = D 1(exp(βΘ))W. Substituting this in (53), we have z c11S D 1(exp(βΘ))W W 1D(exp(βΘ)) (54) W W 1 D(exp(βΘ)) D 1(exp(βΘ)) . (55) It further holds that D(exp(βΘ)) max s exp (βθ(s)) max{1, exp[β max s θ(s)])}, (56) where the last relation equals 1 if θ(s) < 0 for all s. Similarly, D 1(exp(βΘ)) 1 mins exp (βθ(s)) 1 min{1, exp[β mins θ(s)])}. (57) Furthermore, by the properties of the l2-induced norm, S max 1 i S wi 1 (59) S max{exp(βΘ), max 2 i S wi 1} (60) S max{1, exp[β max s θ(s)], max 2 i S wi 1)}. (61) W 1 = 1 σmin(W) (62) ! 1 σmin(W) (63) = (σmax(W))S 1 QS i=1 σi(W) (64) | det(W)|. (65) The determinant of W is a sum of products involving its entries. To upper bound (65) independently of d, we lower bound its denominator by upper and lower bounds on the entries [W]i,1 that are independent of d, depending on their sign: min{1, exp[β min s θ(s)])} [W]i,1 max{1, exp[β max s θ(s)])}. (66) Using this, together with (53), (55), (56), (57), and (61), we showed that z c11S is upper bounded by a constant independent of d. This concludes the proof. Policy Gradient with Tree Expansion A.9. Bias Estimates Lemma A.2. For any matrix A and ˆA, h=1 ˆAh 1( ˆA A)Ak h. Proof. The proof follows from first principles: h=1 ˆAh 1( ˆA A)Ak h = h=1 ˆAh 1 ˆAAk h h=1 ˆAh 1AAk h (67) h=1 ˆAh Ak h h=1 ˆAh 1Ak h+1 (68) h=1 ˆAh Ak h h=2 ˆAh 1Ak h+1 (69) = ˆAk Ak. (70) Henceforth, will refer to , i.e. the induced infinity norm. Also, for brevity, we denote πC d,θ and ˆπC d,θ by πθ and ˆπθ, respectively. Similarly, we use dπθ and dˆπθ to denote dπC d,θ and dˆπC d,θ. As for the induced norm of the matrix P and its perturbed counterpart ˆP, which are of size S A S, we slightly abuse notation and denote P ˆP = maxs{ Ps ˆPs }, where Ps is as defined in Section 2. Definition A.3. Let ϵ be the maximal model mis-specification, i.e., max{ P ˆP , r ˆr } = ϵ. Lemma A.4. Recall the definitions of Rs, Ps, Rπb and P πb from Section 2, and respectively denote their perturbed counterparts by ˆRs, ˆPs, ˆRπb and ˆP πb. Then, for ϵ defined in Definition A.3, max{ Rs ˆRs , Ps ˆPs , Rπb ˆRπb , P πb ˆP πb } = O(ϵ). (71) Proof. The proof follows easily from the fact that the differences above are convex combinations of P ˆP and r ˆr. Lemma A.5. Let πθ be as in (5), and let ˆπθ also be defined as in (5), but with Rs, Ps, P πb replaced by their perturbed counterparts ˆRs, ˆPs, ˆP πb throughout. Then, πC d,θ ˆπC d,θ = O(βdγ dϵ). (72) Proof. To prove the desired result, we work with (5) to bound the error between Rs, Ps, P πb, Rπb and their perturbed versions. First, we apply Lemma A.2 together with Lemma A.4 to obtain that (P πb)k ( ˆP πb)k = O(kϵ). Next, denote by M the argument in the exponent in (5), i.e. M := β[Cs,d + Ps(P πb)d 1Θ]. Similarly, let ˆ M be the corresponding perturbed sum that relies on ˆP and ˆr. Combining the bounds from Lemma A.4, and using the triangle inequality, we have that ˆ M M = O(βdγ dϵ). The factor γ d appears because Cs,d includes the term γ d as shown in Lemma 4.2. Eq. (5) states that the C-Soft Tree Max policy in the true environment is πθ = exp(M)/(1 exp(M)). Similarly define ˆπθ using ˆ M for the approximate model. Then, ˆπθ = (πθ exp( ˆ M M))1 exp(M)/(1 exp( ˆ M)), Policy Gradient with Tree Expansion where denotes element-wise multiplication. Using the above relation, we have that ˆπθ πθ = πθ exp( ˆ M M)1 exp(M) 1 exp( ˆ M) 1 . Using the relation |ex 1| = O(x) as x 0, the desired result follows. Theorem A.6. Let ϵ be as in Definition A.3. Further let ˆπC d,θ being the corresponding approximate policy as given in Lemma 4.2. Then, the policy gradient bias is bounded by θ ν V πθ θ ν V ˆπθ = O 1 (1 γ)2 Sβ2dγ dϵ . (73) We first provide a proof outline for conciseness, and only after it the complete proof. Proof outline. First, we prove that max{ Rs ˆRs , Ps ˆPs , Rπb ˆRπb , P πb ˆP πb } = O(ϵ). This follows from the fact that the differences above are suitable convex combinations of either the rows of P ˆP or r ˆr. We use the above observation along with the definitions of πC d,θ and ˆπC d,θ given in (5) to show that πC d,θ ˆπC d,θ = O(βdγ dϵ). The proof for the latter builds upon two key facts: (a) (P πb)k ( ˆP πb)k Pk h=1 ˆP πb h 1 ˆP πb P πb pπb k h = O(kϵ) for any k 0, and (b) |ex 1| = O(x) as x 0. Next, we decompose the LHS of (7) to get i=1 ˆX1(s) ˆXi 1(s) Xi(s) ˆXi(s) Xi+1(s) X4(s), where X1(s) = dπC d,θ(s) R, X2(s) = ( θ log πC d,θ( |s)) RS A, X3(s) = D(πC d,θ( |s)) RA A, X4(s) = QπC d,θ(s, ) RA A, and ˆX1(s), . . . , ˆX4(s) are similarly defined with πC d,θ replaced by ˆπC d,θ. Then, we show that, for i = 1, . . . , 4, (i) Xi(s) ˆXi(s) = O(γ dϵ) and (ii) max{ Xi , ˆXi } is bounded by problem parameters. From this, the desired result follows. Proof. We have ν V π θ (74) = Es dπθ ,a πθ( |s) [ θ log πθ(a|s)Qπθ(s, a)] Es dˆπθ ,a ˆπθ( |s) θ log ˆπθ(a|s)Qˆπθ(s, a) (75) dπθ(s)πθ(a|s) θ log πθ(a|s)Qπθ(s, a) dˆπθ(s)ˆπθ(a|s) θ log ˆπθ(a|s)Qˆπθ(s, a) (76) dπθ(s)( θ log πθ( |s)) D(πθ( |s))Qπθ(s, ) (77) dˆπθ(s)( θ log ˆπθ( |s)) D(ˆπθ( |s))Qˆπθ(s, ) (78) i=1 ˆX1(s) ˆXi 1(s) Xi(s) ˆXi(s) Xi+1(s) X4(s), (80) where X1(s) = dπθ(s) R, X2(s) = ( θ log πθ( |s)) RS A, X3(s) = D(πθ( |s)) RA A, X4(s) = Qπθ(s, ) RA A, and ˆX1(s), . . . , ˆX4(s) are similarly defined with πθ replaced by ˆπθ. Therefore, θ ν V πθ ν V π θ max s Γ(s) S, (81) i=1 ˆX1(s) ˆXi 1(s) Xi(s) ˆXi(s) Xi+1(s) X4(s) . (82) Policy Gradient with Tree Expansion Next, since dπθ, dˆπθ, πθ, and ˆπθ are all distributions, we have max{|X1(s)|, | ˆ X1(s)|, |X3(s, a)|, | ˆ X3(s, a)|} 1. (83) Separately, using Lemma 4.3, we have X2 = θ log πθ(a|s) β( IA + 1Aπ θ ) Ps (P πb)d 1 . (84) Since all rows of the above matrices have non-negative entries that add up to 1, we get In the rest of the proof, we bound each of X1 ˆ X1 , . . . , X4 ˆ X4 . X4 1 1 γ . (86) Similarly, the same bounds hold for ˆ X1, ˆ X2, ˆ X3 and ˆ X4. From, we have X1 ˆ X1 (1 γ) t=0 γt ν (P πθ)t ν (P ˆπθ)t (87) t=0 γttdϵ (88) t=0 γtt (89) The last relation follows from the fact that (1 γ) 1 = P t=0 γt, which in turn implies t=0 tγt. (91) From Lemma A.5, it follows that X3 ˆ X3 = O(βdϵ). (92) Next, recall that from Lemma 4.3 that X2(s, ) = β IA 1A(πθ) Ps (P πb)d 1 . X2(s, ) ˆ X2(s, ) β IA 1A(πθ) Ps (P πb)d 1 ˆP πb d 1 (93) + β IA 1A(πθ) Ps ˆPs ˆP πb d 1 (94) + β 1A(πθ) 1A(ˆπθ) ˆPs ˆP πb d 1 . (95) Following the same argument as in (85) and applying Lemma A.2, we have that (93) is O(βdϵ). Similarly, from the argument of (85), Eq. (94) is O(βϵ). Lastly, (95) is O(βdϵ) due to Lemma A.5. Putting the above three terms together, we have that X2(s, ) ˆ X2(s, ) = O(βdϵ). (96) Policy Gradient with Tree Expansion Since the state-action value function satisfies the Bellman equation, we have Qπθ = r + γPQπθ (97) and Qˆπθ = ˆr + γ ˆPQˆπθ. (98) Consequently, Qπθ Qˆπθ r ˆr + γ PQπθ PQˆπθ + γ PQˆπθ ˆPQˆπθ (99) ϵ + γ P Qπθ Qˆπθ + γ P ˆP Qˆπθ (100) ϵ + γ Qπθ Qˆπθ + γ 1 γ ϵ, (101) which finally shows that X4 ˆ X4 = Qπθ Qˆπθ ϵ (1 γ)2 . (102) B. Experiments B.1. Implementation Details The environment engine is the highly efficient Atari-Cu LE (Dalton et al., 2020), a CUDA-based version of Atari that runs on GPU. Similarly, we use Atari-Cu LE for the GPU-based breadth-first TS as done in (Dalal et al., 2021). We train Soft Tree Max for depths d = 1 . . . 8, with a single worker. We use five seeds for each experiment. For the implementation, we extend Stable-Baselines3 (Raffin et al., 2019) with all parameters taken as default from the original PPO paper (Schulman et al., 2017). For depths d 3, we limited the tree to a maximum width of 1024 nodes and pruned non-promising trajectories in terms of estimated weights. Since the distributed PPO baseline advances significantly faster in terms of environment steps, for a fair comparison, we ran all experiments for one week on the same machine and use the wall-clock time as the x-axis. We use Intel(R) Xeon(R) CPU E5-2698 v4 @ 2.20GHz equipped with one NVIDIA Tesla V100 32GB. B.2. GPU-Based Tree Expansion Implementation Figure 4 illustrates the GPU-based tree expansion mechanism used in our implementation. We achieve efficient parallelization by duplicating and concatenating all states in the current level of the tree with each possible action, then advancing them simultaneously with a single forward pass through the simulator. The implementation follows these steps: 1. Start with a root state s0 and expand it across all actions a A. 2. For each depth level t from 1 to d: Collect all states st from the previous level. Duplicate each state for each action to create a batch of state-action pairs. Submit the entire batch to the GPU simulator in a single forward pass. Collect the resulting next states and corresponding rewards. If t < d and pruning is enabled, select the top-k most promising branches based on accumulated rewards. 3. For the leaf states (at depth d), compute the neural network outputs Wθ(sd). 4. Combine the accumulated rewards along each trajectory with the corresponding leaf state values. 5. Compute the final policy logits according to Equation (2) and apply the softmax operation. This parallel implementation allows us to efficiently explore a larger number of trajectories compared to sequential tree expansion, making deeper tree depths practically feasible. Policy Gradient with Tree Expansion Figure 4. A diagram of the tree expansion used by Soft Tree Max. In every step, the states in the current level of the tree are duplicated and concatenated with each possible action. The resulting state-action pairs are then fed as a batch to the GPU simulator to generate the next level of states. Finally, the states of the last level d are inserted into the neural network Wθ and the logits are computed using the corresponding rewards along each trajectory. Algorithm 1 C-Soft Tree Max Input: GPU environment G, network θ, depth d Init tensors: state S = [s], action A0 = [0, 1, 2, .., A 1], reward R = [0] for id = 0 to d 1 do S S A, R R A {Replicate state and reward tensors A times} r, S = G([ S, A]) {Feed [ S, A] to simulator and advance} R R + γid r, S S {Accumulate discounted reward } A A A {Replicate action tensor A times} end for ls,a Averageπb A0=a( R + γdθ( S))/γd{Weighted average induced by πb} Return π(a|s0) exp [βls,a(d; θ)] {Return optimal action at the root} B.3. Algorithms This section provides the pseudocode for our Soft Tree Max implementation. Algorithm 1 details the C-Soft Tree Max policy computation, which efficiently utilizes GPU parallelization to perform tree expansion. Algorithm 2 shows how Soft Tree Max integrates with the PPO algorithm, distinguishing the usage of our new policy in red. Note that in Algorithm 2, we use Generalized Advantage Estimation (GAE) with λ = 0.95 for calculating advantage estimates, which is the standard configuration in the stable-baselines3 PPO implementation that we build upon. B.4. Time-Based Training Curves We provide the training curves in Figure 5. For brevity, we exclude a few of the depths from the plots. As seen, there is a clear benefit for Soft Tree Max over distributed PPO with the standard softmax policy. In most games, PPO with the Soft Tree Max policy shows very high sample efficiency: it achieves higher episodic reward although it observes much less episodes, for the same running time. Policy Gradient with Tree Expansion Algorithm 2 Soft Tree Max-PPO 1: Initialize policy parameters θ0 2: Initialize value function parameters ϕ 3: for k = 0, 1, 2, . . . do 4: Collect set of trajectories Dk = {τi} by running policy πd,θk from Algorithm 1 5: Compute rewards-to-go ˆRt 6: Compute advantage estimates ˆAt using GAE with λ = 0.95 7: for each epoch do 8: for each minibatch do 9: Compute policy ratio rt(θ) = πd,θ(at|st) πd,θk (at|st) 10: Compute clipped surrogate objective: 11: LCLIP (θ) = Et[min(rt(θ) ˆAt, clip(rt(θ), 1 ϵ, 1 + ϵ) ˆAt)] 12: Update θ with gradient step on LCLIP (θ) 13: Compute value function loss: LV F (ϕ) = (Vϕ(st) ˆRt)2 14: Update ϕ with gradient step on LV F (ϕ) 15: end for 16: end for 17: end for Figure 5. Training curves: GPU Soft Tree Max (single worker) vs PPO (256 GPU workers). The plots show average reward and standard deviation over 5 seeds. The x-axis is the wall-clock time. The runs ended after one week with varying number of time-steps. The training curves correspond to the evaluation runs in Figure 3. B.5. Step-Based Training Curves In Figure 6 we also provide the same convergence plots where the x-axis is now the number of online interactions with the environment, thus excluding the tree expansion complexity. As seen, due to the complexity of the tree expansion, less steps are conducted during training (limited to one week) as the depth increases. In this plot, the monotone improvement of the reward with increasing tree depth is noticeable in most games. Policy Gradient with Tree Expansion Figure 6. Training curves: GPU Soft Tree Max (single worker) vs PPO (256 GPU workers). The plots show average reward and standard deviation over 5 seeds. The x-axis is the number of online interactions with the environment. The runs ended after one week with varying number of time-steps. The training curves correspond to the evaluation runs in Figure 3. We note that not for all games we see monotonicity. Our explanation for this phenomenon relates to how immediate reward contributes to performance compared to the value. Different games benefit differently from long-term as opposed to short-term planning. Games that require longer-term planning need a better value estimate. A good value estimate takes longer to obtain with larger depths, in which we apply the network to states that are very different from the ones observed so far in the buffer (recall that as in any deep RL algorithm, we train the model only on states in the buffer). If the model hasn t learned a good enough value function yet, and there is no guiding dense reward along the trajectory, the policy becomes noisier, and can take more steps to converge even more than those we run in our week-long experiment. For a concrete example, let us compare Breakout to Gopher. Inspecting Fig. 6, we observe that Breakout quickly (and monotonically) gains from large depths since it relies on the short term goal of simply keeping the paddle below the moving ball. In Gopher, however, for large depths ( =5), learning barely started even by the end of the training run. Presumably, this is because the task in Gopher involves multiple considerations and steps: the agent needs to move to the right spot and then hit the mallet the right amount of times, while balancing different locations. This task requires long-term planning and thus depends more strongly on the accuracy of the value function estimate. In that case, for depth 5 or more, we would require more train steps for the value to kick in and become beneficial beyond the gain from the reward in the tree. The figures above convey two key observations that occur for at least some non-zero depth: (1) The final performance with the tree is better than PPO (Fig. 3); and (2) the intermediate step-based results with the tree are better than PPO (Fig. 6). This leads to our main takeaway from this work there is no reason to believe that the vanilla policy gradient algorithm should be better than a multi-step variant. Indeed, we show that this is not the case. C. Further discussion C.1. The case of λ2(P πb) = 0 When P πb is rank one, it is not only its variance that becomes 0, but also the norm of the gradient itself (similarly to the case of d ). Note that such a situation will happen rarely, in degenerate MDPs. This is a local minimum for Soft Tree Max and it would cause the PG iteration to get stuck, and to the optimum in the (desired but impractical) case where πb is the optimal policy. However, a similar phenomenon was also discovered in the standard softmax with deterministic policies: θ(s, a) for one a per s. PG with softmax would suffer very slow convergence near these local equilibria, as observed in (Mei et al., 2020a). To see this, note that the softmax gradient is θ log πθ(a|s) = ea πθ( |s), where ea [0, 1]A is the vector with 0 everywhere except for the a-th coordinate. I.e., it will be zero for a deterministic policy. Soft Tree Max avoids these local optima by integrating the reward into the policy itself (but may get stuck in another, as discussed above).