# model_tensor_planning__0589f342.pdf Published in Transactions on Machine Learning Research (08/2025) Model Tensor Planning An T. Le1,7, Khai Nguyen7, Minh Nhat Vu5,6, João Carvalho1, Jan Peters1,2,3,4 {an, joao, jan}@robot-learning.de 1Intelligent Autonomous Systems, Department of Computer Science, Technical University of Darmstadt, Germany 2Systems AI for Robot Learning, German Research Center for AI (DFKI), Germany 3Hessian Center for Artificial Intelligence (hessian.AI), Germany 4Centre for Cognitive Science, Technical University of Darmstadt, Germany 5Automation & Control Institute, TU Wien, Austria 6Austrian Institute of Technology (AIT), Vienna, Austria 7Vin Robotics and Vin University, Vietnam Reviewed on Open Review: https: // openreview. net/ forum? id= fk1ZZd XCE3 Sampling-based model predictive control (MPC) offers strong performance in nonlinear and contact-rich robotic tasks, yet often suffers from poor exploration due to locally greedy sampling schemes. We propose Model Tensor Planning (MTP), a novel sampling-based MPC framework that introduces high-entropy control trajectory generation through structured tensor sampling. By sampling over randomized multipartite graphs and interpolating control trajectories with B-splines and Akima splines, MTP ensures smooth and globally diverse control candidates. We further propose a simple β-mixing strategy that blends local exploitative and global exploratory samples within the modified Cross-Entropy Method (CEM) update, balancing control refinement and exploration. Theoretically, we show that MTP achieves asymptotic path coverage and maximum entropy in the control trajectory space in the limit of infinite tensor depth and width. Our implementation is fully vectorized using JAX and compatible with Mu Jo Co XLA, supporting Just-in-time (JIT) compilation and batched rollouts for real-time control with online domain randomization. Through experiments on various challenging robotic tasks, ranging from dexterous in-hand manipulation to humanoid locomotion, we demonstrate that MTP outperforms standard MPC and evolutionary strategy baselines in task success and control robustness. Design and sensitivity ablations confirm the effectiveness of MTP s tensor sampling structure, spline interpolation choices, and mixing strategy. Altogether, MTP offers a scalable framework for robust exploration in model-based planning and control. 1 Introduction Sampling-based Model Predictive Control (MPC) (Mayne, 2014; Lorenzen et al., 2017; Williams et al., 2017) has emerged as a powerful framework for controlling nonlinear and contact-rich systems. Unlike gradientbased or linearization approaches, sampling-based MPC is model-agnostic and does not require differentiable dynamics, making it well-suited for high-dimensional, complex systems such as legged robots (Alvarez-Padilla et al., 2024) and dexterous manipulators (Li et al., 2024). Moreover, its inherent parallelism enables efficient deployment on modern hardware (e.g., GPUs), allowing for high-throughput simulation and online domain randomization in real-time control pipelines (Pezzato et al., 2025). Despite these advantages, a fundamental limitation remains: sampling-based MPC is typically local in its search behavior. Most methods perturb a nominal trajectory or refine the current best samples, which makes them susceptible to local minima and unable to consistently discover globally optimal solutions (Xue et al., 2024). While the Cross-Entropy Method (CEM) (De Boer et al., 2005) has shown promise in highdimensional control and sparse-reward settings (Pinneri et al., 2021), it still suffers from mode collapse Published in Transactions on Machine Learning Research (08/2025) when sampling locally (Zhang et al., 2022), leading to suboptimal behaviors. The curse of dimensionality exacerbates this issue, as the number of samples required to explore control spaces grows exponentially with the planning horizon and control dimension, posing a bottleneck if compute or memory is limited. These challenges motivate the need for a more effective, high-entropy sampling mechanism for control generation. Evolutionary Strategies (ES) (Hansen, 2016; Wierstra et al., 2014; Salimans et al., 2017) have also been applied in sampling-based MPC settings to improve sampling exploration. While they improve over purely local strategies in some tasks, our experiments (cf. Section 4.1) reveal that ES still fails to systematically explore multimodal control landscapes, often yielding inconsistent performance on tasks requiring long-term coordinated actions. In this work, we introduce Model Tensor Planning (MTP), a novel sampling-based MPC framework that enables globally exploratory control generation through structured tensor sampling. MTP reformulates control sampling as tensor operations over randomized multipartite graphs, enabling efficient generation of diverse control sequences with high entropy. To balance exploration and exploitation, we propose a simple yet effective β-mixing mechanism that combines globally exploratory samples with locally exploitative refinements. We also provide a theoretical analysis under bounded-variation assumptions, showing that our sampling scheme achieves asymptotic path coverage, approximating maximum entropy in trajectory space. MTP is designed with real-time applicability with matrix-based formulation, which is compatible with Justin-time (JIT) compilation and vectorized mapping (e.g., via JAX vmap (Bradbury et al., 2018)), allowing high-throughput sampling, batch rollout evaluation, and online domain randomization on modern simulators. Our main contributions are as follows: We propose tensor sampling, a novel structured sampling strategy for control generation, and provide theoretical justification via asymptotic path coverage. We introduce a simple β-mixing mechanism that effectively balances exploration and exploitation by blending high-entropy and local samples within the modified CEM update rule. We demonstrate that MTP is highly compatible with modern vectorized simulators, enabling efficient batch rollout evaluation and robust real-time control in high-dimensional, contact-rich environments. 2 Preliminary Notations and Assumptions. We consider the problem of sampling-based MPC. Given a dynamics model x = f(x, u), we consider the path sampling problem in the control space U Rn with the control u U having n-dimensions at the current system state x X Rd. Typically, a batch of control trajectories is sampled, rolled out through the dynamics model, and evaluated using a cost function. Let a control path be u : [0, 1] u, u(t) U, we can define the path arc length as TV(u) = sup M N+,0=t1,...,t M=1 PM 1 i=1 u(ti+1) u(ti) . (1) We define F as the set of all control paths that are uniformly continuous with bounded variation TV (u) < , u F. This assumption is common in many control settings, where the control trajectories are bounded in time and control space (i.e., both time and control spaces are compact). Throughout this paper, we narrate the preliminary and the tensor sampling method in matrix definitions, discretizing continuous paths with equal time intervals. 2.1 Cross-Entropy Method for Sampling-based MPC Consider a discretized dynamical system xt+1 = f(xt, ut), t = 0, . . . , T 1 with horizon T, where xt Rd, ut Rn denotes the state the control at time step t. Given the state cost function c(x, u) and the terminal cost c T (x), the objective is to minimize a cumulative cost function t=0 c(xt, ut) + c T (x T ), s.t. xt+1 = f(xt, ut) (2) Published in Transactions on Machine Learning Research (08/2025) Linear B-spline Akima-spline 0 20 40 60 80 100 0 Control Magnitude 0 20 40 60 80 100 Steps 0 20 40 60 80 100 0 0.08 0.16 0.24 0.32 0.4 0 0.08 0.16 0.24 0.32 0.4 Time [s] 0 0.08 0.16 0.24 0.32 0.4 Linear B-spline Akima-spline CEM Figure 1: Comparison of MTP interpolation methods versus CEM on Push T environment. The cost of Push T is the sum of the position and orientation error to the green target (without the guiding contact cost), and the initial T pose is randomized. The first row depicts the cost convergence over 10 seeds. In most seeds, CEM struggles to push the object due to the lack of exploration (e.g., mode collapsing), while MTP variants always find the correct contact point to achieve the task. Note that the control magnitude of MTP is high due to the global explorative samples (see second & third rows), compared to the white noise samples (blue). B-spline helps regulate the control magnitude due to its barycentric weightings, while retaining exploration behaviors. The last row illustrates the control trajectories between 64 tensor samples and 64 white noise trajectories. Experiment videos are publicly available at https://sites.google.com/view/tensor-sampling/. where the state trajectory τ = [x0, . . . , x T ] R(T +1) d, and control trajectory U = [u0, . . . , u T 1] RT n are the dynamics rollout, c(xt, ut) is the immediate cost at each time step, and c T (x T ) is the terminal cost. CEM optimizes the control sequence U iteratively by approximating the optimal control distribution using a parametric probability distribution, typically Gaussian. Initially, the control inputs are sampled from an initial distribution parameterized by mean µ RT n and standard deviation σ RT n. At each iteration, CEM performs the following steps iteratively: Sampling. Draw B candidate control sequences from the current Gaussian distribution: U (k) N(µ, diag(σ2)), k = 1, . . . , B. (3) Evaluation. Rollout τ (k) from the dynamics model and compute the cost J(τ (k), U (k)) for each sampled control sequence by simulating the system dynamics. Elite Selection. Choose the top-E B elite candidates of control sequences that have the lowest cost, forming an elite set E = {U (k)}E k=1. Distribution Update. Update the parameters (µ, σ) based on elite samples: U E U, σ2 new = 1 E 1 U E (U µnew)2. (4) An exponential smoothing update rule can be used for stability: µ αµ + (1 α)µnew, σ ασ + (1 α)σnew, (5) where α [0, 1) is a smoothing factor. Termination Criterion. The iterative process continues until a convergence criterion is met or a maximum number of iterations is reached. The optimal control sequence is approximated by the final mean µ of the Gaussian distribution. Published in Transactions on Machine Learning Research (08/2025) 2.2 Spline-based Controls Splines provide a powerful representation for trajectory generation in MPC due to their flexibility, continuity properties, and ease of parameterization (Jankowski et al., 2023; Carvalho et al., 2025). In this work, we focus on spline-parametrization of control trajectories. Spline-based trajectories ensure smooth and feasible control inputs that satisfy constraints and objectives inherent to MPC frameworks. A spline is defined as a piecewise polynomial function u(t) : [0, T] U, which is polynomial within intervals divided by knots t1, . . . , t M, with continuity conditions enforced at these knots. In particular, Knots. A knot ti [0, T] is a time point where polynomial pieces join. We have a non-decreasing sequence of knots 0 = t1 . . . t M = T, which partition the time interval [0, T] into pieces [ti, ti+1] so that the path is polynomial in each piece. We may often have double or triple knots, meaning that several consecutive knots ti = ti+1 are equal, especially at the beginning and end, as this can ensure boundary conditions for zero higher-order derivatives. Waypoints. A waypoint u(t) U is a point on the path, typically corresponding to u(ti). Control Points. A set of control points Z = {zi|zi Rn}K i=1 parametrizes the spline via basis functions. B-Spline Parameterization. In B-splines (de Boor, 1973), the path u is expressed as a linear combination of control points zi Z i=1 Bi,p(t)zi, s.t. i=1 Bi,p(t) = 1, (6) where Bi,p : R R maps the time t to the weighting of the ith control point, depicting the ith control point weight for blending (i.e., as with a probability distribution over i). Hence, the control waypoint u(t) is always in the convex hull of control points. The B-spline functions Bi,p(t) are fully specified by a non-decreasing series of time knots 0 = t1 . . . t M = T and the integer polynomial degree p {0, 1, . . .} by Bi,0(t) = [ti t ti+1], for 1 i M 1, Bi,p(t) = t ti ti+p ti Bi,p 1(t) + ti+p+1 t ti+p+1 ti+1 Bi+1,p 1(t), for 1 i M p 1. (7) Bi,0 are binary indicators of t [ti, ti+1] with 1 i M 1. The 1st-degree B-spline functions Bi,1 have support in t [ti, ti+2] with 1 i M 2, such that PM 2 i=1 Bi,1(t) = 1 holds. In general, degree p B-spline functions Bi,p have support in t [ti, ti+p+1] with 1 i M p 1. We need K = M p 1 control points z1:K, which ensures the normalization property PK i=1 Bi,p(t) = 1 for every degree. Akima-Spline Parameterization. The Akima spline (Akima, 1974) is a piecewise cubic interpolation method that exhibits C1 smoothness by using local points to construct the spline, avoiding oscillations or overshooting in other interpolation methods, such as B-splines. In other words, an Akima spline is a piecewise cubic spline constructed to pass through control points with C1 smoothness. Given the control point set Z with K = M, the Akima spline constructs a piecewise cubic polynomial u(t) for each interval [ti, ti+1] ui(t) = di(t ti)3 + ci(t ti)2 + bi(t ti) + ai, (8) where the coefficients ai, bi, ci, di U are determined from the conditions of smoothness and interpolation. Let mi = (zi+1 zi)/(ti+1 ti), the spline slope is computed as si = |mi+1 mi|mi 1 + |mi 1 mi 2|mi |mi+1 mi| + |mi 1 mi 2| . (9) The spline slopes for the first two points at both ends are s1 = m1, s2 = (m1 + m2)/2, s M 1 = (m M 1 + m M 2)/2, s M = m M 1. Then, the polynomial coefficients are uniquely defined ai = ui, bi = si, ci = (3mi 2si si+1)/(ti+1 ti), di = (si + si+1 2mi)/(ti+1 ti)2. (10) Published in Transactions on Machine Learning Research (08/2025) Linear B-spline p=2 B-spline p=3 Akima-spline Figure 2: Illustration of different tensor interpolations on an evenly spaced graph with M = 3, N = 9. With a higher B-spline degree, the control trajectories exhibit more smoothness and conservative behavior, while Akima-spline aggressively and smoothly tracks control-waypoints. Note that we do not consider boundary conditions for B-spline interpolation in tensor sampling. Motivation. Spline representation provides several benefits. (i) It ensures smooth control trajectories (Watson & Peters, 2023) with guaranteed continuity of positions, velocities, and accelerations under mild assumptions of the dynamics. (ii) Spline simplifies complex trajectories through a few control points and efficiently incorporates constraints and boundary conditions, enabling efficient learning and planning by dimension reduction (Carvalho et al., 2025). (iii) It enables easy numerical optimization thanks to differentiable and convex representations. We first propose tensor sampling a batch control path sampler having high explorative behavior, and investigate its path coverage property over the compact control space. Then, we incorporate the tensor sampling with the modified CEM update rule, balancing exploration and exploitation in cost evaluation, forming an overall vectorized sampling-based MPC algorithm. 3.1 Tensor Sampling Inspired by (Le et al., 2025), we utilize the random multipartite graph as a tensor discretization structure to approximate global path sampling. Definition 1 (Random Multipartite Graph Control Discretization). Consider a graph G(M, N) = (V, W) on control space U, the node set V = {Li}M i=1 is a set of M layers. Each layer Li = {uj U | uj Uniform(U)}N j=1 contains N control-waypoints sampled from a uniform distribution over control space (i.e., bounded by control limits). The edge set W is defined by the union of (forward) pair-wise connections between layers W = {(ui, ui+1) | ui Li, ui+1 Li+1, 1 i < M}, leading to a complete M-partite directed graph. Sampling from G(M, N). The graph nodes are represented as the control-waypoint tensor for all layers Z RM N n, within the control limits. To sample a batch of B N+ control paths with a horizon T, we subsample with replacement C RB M n from the set of all combinatorial paths in G (cf. Algorithm 1) and further interpolate C into control trajectories U RB T n with different smooth structures, e.g., using Eq. (6) or Eq. (8). Sampling with replacement is cheap O(MN), while sampling without replacement is O(N M). Sampling without replacement adds overheads due to re-indexing or tree-traversing to get batch of sequence indices (depending on the low-level implementation of sampling) but offers better diversity. In practice, we use sampling with replacement, which does not really affect diversity (see last row in Fig. 1), is faster and scales well with JAX vectorized operations. To see this, we have N M combinatorial paths in G(M, N), each path in G(M, N) has uniform 1/N M mass, and the probability of sampling the same paths is small. Control Path Interpolation. Straight-line interpolation can be realized straightforwardly by simply probing an H = T/M number of equidistant points between layers, forming the linear coverage trajectories Published in Transactions on Machine Learning Research (08/2025) U RB T n. However, there exist discontinuities at control waypoints using linear interpolation. Hence, we motivate spline tensor interpolations for sampling smooth control paths (cf. Fig. 2). Definition 2 (B-spline Control Trajectories). Given two time sequences ti = i/(M+p+1), i {0, . . . , M+p} and tj = j/T, j {0, . . . , T 1}, the B-Spline matrix Bp RM T can be constructed recursive following Eq. (7), with index i, j corresponding to the element Bi,p(tj). Then, the control trajectories can be interpolated by performing einsum on the M dimension of Bp RM T , C RB M n, resulting U RB T n. B-spline control trajectories exhibit conservative behavior as they are strictly inside the control point convex hull. Alternatively, they can be forced to pass through all control points by adding a multiplicity of p per knot (de Boor, 1973). However, this method wastes computation by increasing the B-spline matrix size to Mp N, and still cannot avoid the overshooting problem. Thus, we further propose the Akima-spline control interpolation. Definition 3 (Akima-spline Control Trajectories (Le et al., 2025)). Given the time sequences ti = i/M, i {0, . . . , M 1} representing the M layer time slices and the control points C RB M n, the Akima polynomial parameters A RB (M 1) 4 n can be computed following Eq. (10) in batch. Then, given the time sequence tj = j/T, j {0, . . . , T 1}, the control trajectories U RB T n are interpolated following polynomial interpolation Eq. (8). In the next section, we investigate whether the path distribution support of tensor sampling approximates the support of all possible paths in the control space, with linear interpolation. B-spline and Akima-spline variants are deferred for future work. 3.2 Path Coverage Guarantee We analyze the path coverage property of G(M, N) for sampling control paths with linear interpolation. In particular, we investigate that any feasible path in the control space can be approximated arbitrarily well by a path in the random multipartite graph G(M, N) as the number of layers M and the number of waypoints per layer N approach infinity. Theorem 1 (Asymptotic Path Coverage). Let u F be any control path and G(M, N) be a random multipartite graph with M layers and N uniform samples per layer (cf. Definition 1). Assuming a time sequence (i.e., knots) 0 = t1 < t2 < . . . < t M = 1 with equal intervals, associating with layers L1, . . . , LM G(M, N) respectively, then lim M,N min g G(M,N) u g = 0. In intuition, Theorem 1 states the support of path distribution G(M, N) approximates F and converges to F as M, N . Thus, sampling paths from G(M, N) provides a tensorized mechanism to efficiently sample any possible path from F, which allows vectorized sampling. Remark. As M, N , for any control path g F, then g G(M, N). Hence, G(M, N) represents all homotopy classes in the limit M, N , and sampling paths from G(M, N) approximate sampling from all possible paths. To quantify the exploration level of tensor sampling, one standard way is to investigate its path distribution entropy. In intuition, when M, N , tensor sampling entropy also approaches infinity due to uniform sampling per layer, which is further discussed in Appendix A.2. Note that this theory investigation serves as a guiding principle, while practical settings trade off entropy with success rate and runtime feasibility. Algorithm 1: Sampling Paths From G(M, N) Input: Control waypoints Z RM N n, number of paths B Output: Sampled control-waypoints C RB M n 1 I randint((B, M), 1, N). // batch sample with replacement from 1, . . . , N with shape (B, M) 2 C parse_index(Z, I). // extract waypoints from sampled indices into C RB M n Published in Transactions on Machine Learning Research (08/2025) Practical Settings. We typically only set M < T. In principle, increasing N should enable finer-grained exploration over the trajectory space. However, we observe diminishing returns when N increases while keeping the total number of sampled trajectories B fixed (cf. Fig. 8). Intuitively, the underlying graph becomes denser, and the number of explored paths remains constant, resulting in only marginal performance improvements. Therefore, we recommend choosing N proportionally to B and within the bounds of available GPU memory, to maintain computational efficiency without oversampling from a small sample size. 3.3 Algorithm Here, we present the overall algorithm combining tensor and local sampling with smooth structure options in Algorithm 2. We propose a simple mixing mechanism with β [0, 1] by concatenating explorative and exploitative samples, forming a control trajectory tensor U (Line 4-8). We include the current nominal control for system stability at the fixed-point states (e.g., for tracking tasks) (Howell et al., 2022). Using simulators that allow for parallel runs (Todorov et al., 2012; Makoviychuk et al., 2021), rollout and cost evaluation can be efficiently vectorized (Line 9). To tame the noise induced by tensor sampling, we modify the CEM update with softmax weighting on the elite set, for computing the new weighted control means and covariances similar to the MPPI update rule (Williams et al., 2017). We observe that this greatly smoothens the update over timesteps (Line 11-13) (cf. Fig. 9). Finally, similar to Howell et al. (2022), we send the first control of the best candidate, since this control trajectory is evaluated in the simulator rather than the updated mean µ. Notice that we have fixed tensor shapes based on hyperparameters, for all subroutines of sampling, rolling out, and control distribution updates. Algorithm 2 can be JAX jit and vmap over a number of R model perturbations {fj(x, u)}R j=1, for efficient online domain randomization, while maintaining real-time control (cf. Appendix A.5). Algorithm 2: Model Tensor Planning Input: Model f(x, u), graph params M, N, num samples B, mixing rate β [0, 1], planning horizon T. CEM params α [0, 1), λ > 0, σm > 0, E, which are smooth and temperature scalar, minimum variance, elite number, respectively. 1 Choose interpolation type Linear, or B-spline (Definition 2), or Akima (Definition 3). 2 Init the nominal control µ RT n and variance diag(σ2), σ RT n. 3 while Task is not complete do // tensor sampling 4 Uniformly sample Q RM N d on control space U. 5 Sample control waypoints C RP M n with P = βB , using Algorithm 1. 6 Interpolate C using with chosen interpolation method into control trajectories UG RP T n. // local sampling 7 Sample B P 1 local trajectories ULocal N(µ, diag(σ2)). 8 Stack ULocal, UG, µ into U RB T n // Update routine using vectorized simulator 9 Batch rollout X RB T d from model f(x, u), and evaluate cost matrix S(X, U) RB T . 10 Sum cost s = P t S:,t RB and sort top-E elite candidate indices i E. 11 Select candidate U parse_index(U, i E) and compute candidate weights w = exp( 1 λ s[i E]) P exp( 1 12 Compute new mean µ = w U and variance σ = max(w P u U(u µ )2, σm). 13 Update µ µ + α(µ µ ) and σ σ + α(σ σ ). // Send the best evaluated control u Rn 4 Experiments In this section, we investigate our proposed approach with the following research questions/points: Published in Transactions on Machine Learning Research (08/2025) 0 20 40 60 80 100 Steps Cost Convergence MTP-Bspline MTP-Akima DE Open AI-ES 0 20 40 60 80 100 Steps 5.5 Entropy Figure 3: Motivation comparison of MTP methods versus baselines with B = 256 on Navigation environment. The environment is designed to be challenging to reach the green goal, requiring strong exploration to avoid large local minima in the middle (see task details in Appendix A.3). We plan with T = 20 with t = 0.05s. The figures show 5 random traces of white rollouts. (Left) MTP-Akima rollouts reach the green goal very early due to high-entropy tensor sampling, while (Right) Open AI-ES struggles to generate a rollout exploring the way out of large local minima. How does MTP s performance with Akima/B-spline control variants compare to standard samplingbased MPC baselines, and strong-exploratory evolution strategies baselines? How does MTP s performance vary with the number of elites and mixing rate β on interpolation methods (Linear, B-spline, Akima-spline)? How does MTP s performance vary with (i) mixing rate β associating with levels of MTP exploration on complex environments, and (ii) MTP-Bspline degree versus planning performance. We study the cumulative cost J(τ, U) Eq. (2) over the control timestep for each task. For each experiment, we take the minimum cumulative cost in batch rollouts at each timestep, and plot the mean and standard deviation over 5 seeds. Further ablations on sweeping graph parameters M, N, softmax weighting, and JAX planning performance benchmark are presented in Appendix A.5. Practical Settings. All algorithms and environments are implemented in Mu Jo Co XLA (Todorov et al., 2012; Kurtz, 2024), to utilize the jit and vmap JAX operators for efficient vectorized sampling and rollout on multiple model instances. All experiment runs are sim-to-sim evaluated (Mu Jo Co XLA to Mu Jo Co). In particular, we introduce some modeling errors in Mu Jo Co XLA. Then, for online domain randomization, we randomize a set of R models {fj(x, u)}R j=1, then we perform batch sampling and rollout of R B trajectories. Finally, the cost evaluation is averaged on the R domain randomization dimension. Note that, for this paper, we deliberately design the task costs to be simple and set sufficiently short planning horizons to benchmark the exploratory capacity of algorithms. In practice, one may design dense guiding costs to achieve the tasks. Further task details are presented in Appendix A.3. Baselines. For explorative baselines, we choose Open AI-ES (with antithetic sampling) (Salimans et al., 2017), which shows parallelization capability with a high number of workers in high-dimensional modelbased RL settings. Additionally, we choose the recently proposed Diffusion Evolution (DE) (Zhang et al., 2024), bridging the evolutionary mechanism with a diffusion process (Ho et al., 2020), which demonstrates superior performances over classical baselines such as CMA-ES (Hansen, 2016). Both are implemented in evosax (Lange, 2023). For methods that take into account local information (exploitation methods), we compare MTP with standard MPPI (Williams et al., 2017) and Predictive Sampling (PS) (Howell et al., 2022) to sanity check on task completion in sim-to-sim scenarios. 4.1 Motivating Example Here, we provide an experimental analysis of the baselines exploration capacity on Navigation environment (cf. Fig. 3), where the point-mass agent is controlled by an axis-aligned 2-dim velocity controller. We compare MTP-Bspline and MTP-Akima to evolutionary algorithms and standard MPC baselines, with maximum sampling noise settings (cf. Fig. 3). In particular, given the control limits [ 1, 1] on x-y axes, we set the standard deviation σ = 1 for sampling noise of MPPI and PS, and population generation noise for Open AIES and DE. Fig. 3 also shows the cost convergence and the cost entropy curves over timesteps, in which the entropy is computed as H = PB j=1 Pj log Pj, Pj = exp(Jj)/(P l exp(Jl)), where {Jj}B j=1 is the batch of cumulative rollout costs. The entropy represents the diversity of rollout evaluation, implying the exploration Published in Transactions on Machine Learning Research (08/2025) 0 200 400 600 800 1000 0 0 20 40 60 80 100 0 0 40 80 120 160 200 0 10 G1-Standup 0 20 40 60 80 100 0 0 20 40 60 80 100 0 6 Cube-In-Hand 0 200 400 600 800 1000 0 MTP-Bspline MTP-Akima MPPI PS DE Open AI-ES Figure 4: Performance comparison of MTP variants against standard MPC methods (MPPI, PS) and evolutionary algorithms (Open AI-ES, DE). The (horizontal) gray dashed line depicts the task success, while the (vertical) red line represents the timestep such that the first algorithm statistically succeeds the task, or fails last in G1-Walk. capability of algorithms. We observe that MTP-Akima has the highest entropy curve, inducing the lowest cost convergence over timesteps, while other baselines converge to the middle minima. In this case, MTPBspline also struggles due to conservative interpolation in tensor sampling, achieving moderate entropy, yet higher than evolutionary strategies. 4.2 Comparison Experiments We analyze the performance comparison of MTP and the baselines over various robotics tasks representing different dynamics and planning cost settings (cf. Fig. 4). In each task, we tune the baselines and set the same white noise standard deviation for MTP/MPPI/PS to study the performance gain by tensor sampling, while using the default hyperparameters for evolutionary baselines. Comparison Environments. Push T (Chi et al., 2023), Cube-In-Hand (Andrychowicz et al., 2020) require intricate manipulation environments where robust exploration is critical due to complex contact dynamics and precise multi-step manipulation requirements. G1-Standup, G1-Walk represent high-dimensional robotic tasks demanding substantial computational resources and sophisticated control strategies. Crane, Walker (Towers et al., 2024) present underactuated and nonlinear dynamic challenges. To ensure fair comparison, for all baselines, we fix the same number of rollouts B = 16 on Crane, and B = 128 for all other tasks. All tasks are implemented in hydrax (Kurtz, 2024). Further experiment details are in Appendix A.4. The Push T task, which involves pushing a T-shaped object precisely to a target location, particularly highlights the advantage of MTP variants. While MPPI and PS frequently encounter mode collapse due to insufficient exploration, resulting in suboptimal or even failed attempts at solving the task, MTP-Bspline and MTP-Akima consistently achieve low-cost convergence. This underscores the significant benefit of strong exploration enabled by tensor sampling. Evolutionary algorithms like Open AI-ES and DE perform similarly to MTP in Push T, inherently show better exploration than MPPI and PS, but still fall short compared to MTP variants in Cube-In-Hand due to noisy rollouts. Cube-In-Hand requires strong exploration while maintaining intricate control to avoid the cube falling, thus emphasizing the effectiveness of the MTP β-mixing strategy. In the Crane environment, we apply heavy modeling errors of mass, inertia, and pulley/joint damping. MTP variants excel by maintaining stable control trajectories with smooth transitions, effectively navigating the nonlinear dynamics and underactuation. The B-spline interpolation s conservative nature helps avoid Published in Transactions on Machine Learning Research (08/2025) Num. Elites MPPI TS-MPPI MPPI TS-MPPI MPPI TS-MPPI Akima-spline 0.0 0.05 0.25 0.4 0.6 0.75 0.9 1.0 Num. Elites MPPI TS-MPPI 0.0 0.05 0.25 0.4 0.6 0.75 0.9 1.0 Mixing Rate β MPPI TS-MPPI 0.0 0.05 0.25 0.4 0.6 0.75 0.9 1.0 MPPI TS-MPPI Push T G1-Walk Figure 5: Mixing rate β and number of elites E sweep on Push T, G1-Walk tasks with B = 128 to investigate the algorithmic update rule Algorithm 2 Line 9-12. The heatmap indicates accumulated cost over timesteps at termination, and the heat value range is fixed for each task/row. overshooting and instability prevalent in these tasks, thus outperforming both evolutionary algorithms and standard MPC methods that tend to produce erratic control inputs. The Walker task exhibits a rather simple dynamics model and is less sensitive to the sampling distribution due to its relatively simple contact model. We apply no modeling error as a sanity check. Indeed, MTP and classical MPPI/PS perform similarly in simple cases. In G1-Standup, MTP-Akima demonstrates effective humanoid standup due to its aggressive yet smooth trajectory interpolation, enabling efficient exploration and rapid convergence. In contrast, Open AI-ES and DE struggle with the dimensionality, often yielding higher cumulative costs and failing to adequately sample feasible trajectories, resulting in significant performance gaps. MTP variants have marginally higher performance than standard MPC baselines in G1-Standup, but show better control stability for longer G1-Walk before falling. These results underline the capability of MTP to balance exploration and exploitation in high-dimensional tasks systematically. 4.3 Design Ablation We conduct an ablation study analyzing the effect of varying two crucial hyperparameters the number of elites E and the mixing rate β on MTP algorithmic performance. Fig. 5 presents heatmaps indicating accumulated cost over time for each task and interpolation method (MTP-Linear, MTP-Bspline, MTPAkima). Each heatmap illustrates distinct algorithmic realizations at its corners. Specifically, the bottom-left corner represents PS, characterized by a single elite and purely white noise sampling. Conversely, the bottomright corner corresponds to Predictive Tensor Sampling (TS-PS), maintaining a single elite but employing full tensor sampling. Due to the softmax update (Algorithm 2 Line 11-12), the top-left corner realizes MPPI (with adaptive sampling covariance), leveraging all candidate samples with local noise sampling, while the top-right corner reflects Tensor Sampling-MPPI (TS-MPPI), utilizing all candidates and full tensor sampling for maximum exploration. We observe the consistent pattern that MTP performance degrades at the extremes of β. In Push T, moderate values of β lead to significantly lower costs, while the absence of tensor sampling (β = 0) yields poor performance due to inadequate exploration. This observation reinforces the effectiveness of the β-mixing Published in Transactions on Machine Learning Research (08/2025) Linear 0.1 Linear 0.3 Linear 0.6 Linear 0.9 B-spline 0.1 B-spline 0.3 B-spline 0.6 B-spline 0.9 Akima-spline 0.1 Akima-spline 0.3 Akima-spline 0.6 Akima-spline 0.9 Cube-In-Hand 0 20 40 60 80 100 0 0 20 40 60 80 100 Steps 0 20 40 60 80 100 Figure 6: Mixing scalar β sweep on Push T, Cube-In-Hand, G1-Standup environments with B = 128 to investigate the sensitivity of MTP on explorative level. The dashed line represents the successful bar. In Cube-In-Hand, some of the cost curves increase due to the cube falling out of the LEAP hand. strategy for balancing global and local sampling contributions. Interestingly, the number of elites E has a limited impact in Push T, likely due to the task s insensitivity to control stability. However, in G1-Walk, the choice of E is crucial. Using a single elite (E = 1), corresponding to the PS control scheme, leads to unstable and jerky behavior, which aligns with the poor performance observed in Fig. 4. At the other extreme, with β = 1 (full tensor sampling), performance also degrades. This is attributed to the fixed rollout budget B; as M increases, global samples become too sparse to effectively capture the fine-grained control required for stable gait tracking. In this case, exploitation with local samples is essential to maintain intricate motion tracking control. 4.4 Sensitivity Ablation Here, we perform sensitivity analyses for various critical algorithmic hyperparameters. Fig. 6 evaluates sensitivity to the mixing rate β for different tasks. For the Push T environment, results show minimal sensitivity across mixing rates, as the task inherently lacks significant failure modes, ensuring consistent success regardless of the exploration-exploitation balance. In contrast, the Cube-In-Hand task demonstrates high sensitivity, with larger mixing rates causing instability due to the cube falling out of grasp frequently. Optimal performance is thus achieved with lower β values, suggesting careful management of exploration intensity. Furthermore, for the high-dimensional G1-Standup task, a smaller mixing rate helps stabilize the control, enabling the robot to achieve a more consistent and stable stand-up performance. 5 Related Works We review related efforts across two major directions: the vectorization of sampling-based MPC and sampling-based motion planning. While these approaches originate from different planning paradigms, dynamics-aware MPC versus collision-free geometric planning, they share a common structure: interacting with an agent-environment model (e.g., dynamics model or collision checker) that can be vectorized for efficient, batched computation. Both domains benefit from high-throughput sampling, making them increasingly amenable to modern GPUs/TPUs. Sampling-based MPC Vectorization. Sampling-based MPC (Mayne, 2014) has been successfully applied to high-dimensional, contact-rich control problems. Methods such as Predictive Sampling (PS) (Howell et al., 2022), Model Predictive Path Integral (MPPI) (Williams et al., 2017; Watson & Peters, 2023), and Published in Transactions on Machine Learning Research (08/2025) CEM-based MPC (Pinneri et al., 2021) rely on parallel sampling of control trajectories and subsequent rollouts using a system dynamics model. These methods naturally benefit from vectorized simulation backends, and recent works have extended them toward more structured and efficient exploration. For instance, inspired by the diffusion process, DIAL-MPC (Xue et al., 2024) enhances exploration coverage and local refinement simultaneously, achieving high-precision quadruped locomotion and outperforming reinforcement learning policies (Schulman et al., 2017) in climbing tasks. STORM (Bhardwaj et al., 2022) demonstrates GPU-accelerated joint-space MPC for robotic manipulators, achieving real-time performance while handling task-space and joint-space constraints. Other recent efforts integrate GPU-parallelizable simulators, such as Isaac Gym (Makoviychuk et al., 2021), into the MPC loop (Pezzato et al., 2025) for online domain randomization, removing the need for explicit modeling and enabling real-time contact-rich control. In another line, Co VO-MPC (Yi et al., 2024) enhances convergence speed by optimizing the covariance matrix during sampling, leading to performance gains in both simulated and real-world quadrotor tasks. These advances demonstrate that structured, parallel control sampling can be effectively deployed in high-stakes robotics applications using vectorized dynamic models. Motion Planning Vectorization. Recent advances in sampling-based motion planning have demonstrated that classical methods, such as RRT (Kuffner & La Valle, 2000), can be significantly accelerated through parallel computation, while preserving theoretical guarantees like probabilistic completeness, which represents another form of maximum exploration. Early work focused on accelerating specific subroutines like collision checking (Bialkowski et al., 2011; Pan & Manocha, 2012), but more recent efforts have restructured planners for full GPU-native execution. Examples include GMT* (Lawson et al., 2020), VAMP (Thomason et al., 2024), p RRTC (Huang et al., 2025), and Kino-PAX (Perrault et al., 2025), which achieve millisecond-scale planning in high-dimensional configuration spaces by parallelizing sampling, forward kinematics, and tree expansions. GTMP (Le et al., 2025) pushes this even further by implementing the sampling, graph-building, and search pipeline as tensor operations over batch-planning instances, showcasing the feasibility of real-time planning across multiple environments. Complementing sampling-based planning, trajectory optimization methods such as batch CHOMP (Zucker et al., 2013), Stochastic-GPMP (Urain et al., 2022), cu Robo (Sundaralingam et al., 2023), and MPOT (Le et al., 2023) have embraced vectorization to solve hundreds of trajectory refinement problems in parallel. Many of these systems are further enhanced by high-entropy initialization with learned priors (Carvalho et al., 2023; Huang et al., 2024; Nguyen et al., 2025), allowing them to overcome challenging nonconvexities in cluttered environments. These developments collectively demonstrate that both motion planning and MPC can be reformulated as batched, tensor-based pipelines suitable for modern accelerators. Our work draws on these insights to propose a unified sampling-based control framework that operates entirely through tensorized computation, blending global exploration and local refinement in a single batched planning loop. 6 Discussions and Conclusions In this work, we introduced Model Tensor Planning (MTP), a robust sampling-based MPC approach designed to achieve global exploration via maximum entropy sampling. Theoretically, we demonstrated that in the limits of infinite layers M and samples per layer N, our tensor sampling method attains maximum entropy, thereby efficiently approximating the full trajectory space. Furthermore, MTP is intentionally designed to be practically feasible, enabling straightforward implementation for sampling high-entropy control trajectories (see Appendix A.5). While evolutionary strategies algorithms offer improved exploration capabilities (Salimans et al., 2017; Zhang et al., 2024), compared to traditional MPC methods, our experiments highlight their limitations. The inherently noisy mutation processes often fail to achieve consistent high-entropy exploration, limiting their effectiveness in robotics tasks. In contrast, MTP s tensor sampling consistently explores smooth control possibilities and achieves robust performance. Spline-based interpolations are central to the practical implementation of MTP, notably B-spline and Akimaspline. These interpolation methods effectively address discontinuities in simple linear interpolation, ensuring Published in Transactions on Machine Learning Research (08/2025) the generation of smooth, continuous, and dynamically feasible control trajectories (Alvarez-Padilla et al., 2024). The experiments underscore the splines critical role in enhancing trajectory quality, optimizing performance across diverse, complex tasks. We proposed a simple β-mixing strategy for exploration while retaining intricate controls, effectively balancing exploration and exploitation within sampling-based MPC. This flexible strategy allows easy algorithm tuning to various tasks, significantly improving performance stability and robustness across environments with different exploration needs. From the vectorization standpoint, the matrix-based definition of MTP is specifically structured to leverage Just-in-time compilation jit and vectorized mapping vmap provided by JAX (Bradbury et al., 2018) and Mu Jo Co XLA (Todorov et al., 2012). This design choice dramatically accelerates computations, enabling real-time implementation and seamless integration with online domain randomization with vmap, crucial for robust control. Overall, MTP offers an efficient, scalable solution for various robotic tasks that demand high exploration capacity and precise control optimization. Limitations While MTP demonstrates strong performance across diverse control tasks, it inherits several limitations typical of sampling-based methods. First, its computational cost scales with the number of rollouts, making it challenging to deploy on hardware with limited parallel computing. Second, while our tensor-based sampler improves exploration coverage, it does not leverage task-specific priors or learning-based proposal distributions, which could further improve sample efficiency. Finally, MTP relies on a fixed dynamics model, limiting its robustness in partially observed or stochastic environments. Broader Impact Statement This work contributes to the development of efficient sampling-based control by introducing a scalable, highentropy sampling mechanism for model predictive control (MPC). Model Tensor Planning (MTP) opens a promising direction in the design of exploratory algorithms that go beyond local refinements, allowing for global reasoning over control spaces. By enabling maximum entropy exploration via structured tensor operations, MTP provides a framework that may benefit a wide range of decision-making systems requiring robust performance in underexplored, high-dimensional environments, such as dexterous manipulation, legged locomotion, or autonomous vehicles operating under partial observability and uncertainty. From a real-world deployment perspective, MTP maintains controls within physical limits, but its high-rate, tensorized control sequences may induce rapid variations that practical motor systems must robustly execute. While this fast-changing control is beneficial for agility and responsiveness, it necessitates attention to actuator dynamics and hardware safety. Therefore, safety-aware control filtering or actuator-aware smoothness constraints may be incorporated as extensions for deployment. Furthermore, like other MPC approaches, MTP assumes access to reliable state estimation for initializing planning rollouts in the control loop. In practical deployment, this typically requires a real-time state estimator and a simulation back-end that serves as a digital twin. For example, Mu Jo Co XLA can simulate hundreds of dynamics instances in parallel, making it suitable for real-time predictive control. However, realizing this in hardware introduces engineering challenges, such as ensuring low-latency communication between the physical robot and the simulator. We see this digital twin architecture as a promising frontier where algorithmic advances like MTP can be tightly integrated with system-level design for robust, real-time, and scalable autonomous control. Acknowledgments This work was funded by the German Federal Ministry of Education and Research Software Campus project ROBOSTRUCT (01S23067). Published in Transactions on Machine Learning Research (08/2025) Hiroshi Akima. A method of bivariate interpolation and smooth surface fitting based on local procedures. Communications of the ACM, 17(1):18 20, 1974. Juan Alvarez-Padilla, John Z Zhang, Sofia Kwok, John M Dolan, and Zachary Manchester. Real-time wholebody control of legged robots with model-predictive path integral control. ar Xiv preprint ar Xiv:2409.10469, 2024. Open AI: Marcin Andrychowicz, Bowen Baker, Maciek Chociej, Rafal Jozefowicz, Bob Mc Grew, Jakub Pachocki, Arthur Petron, Matthias Plappert, Glenn Powell, Alex Ray, et al. Learning dexterous in-hand manipulation. The International Journal of Robotics Research, 39(1):3 20, 2020. Mohak Bhardwaj, Balakumar Sundaralingam, Arsalan Mousavian, Nathan D Ratliff, Dieter Fox, Fabio Ramos, and Byron Boots. Storm: An integrated framework for fast joint-space model-predictive control for reactive manipulation. In Conference on Robot Learning, pp. 750 759. PMLR, 2022. Joshua Bialkowski, Sertac Karaman, and Emilio Frazzoli. Massively parallelizing the rrt and the rrt. In 2011 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 3513 3518. IEEE, 2011. James Bradbury et al. JAX: composable transformations of Python+Num Py programs, 2018. URL http: //github.com/jax-ml/jax. Joao Carvalho, An T Le, Mark Baierl, Dorothea Koert, and Jan Peters. Motion planning diffusion: Learning and planning of robot motions with diffusion models. In 2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 1916 1923. IEEE, 2023. João Carvalho, An T Le, Piotr Kicki, Dorothea Koert, and Jan Peters. Motion planning diffusion: Learning and adapting robot motion planning with diffusion models. IEEE Transactions on Robotics, 2025. Cheng Chi, Zhenjia Xu, Siyuan Feng, Eric Cousineau, Yilun Du, Benjamin Burchfiel, Russ Tedrake, and Shuran Song. Diffusion policy: Visuomotor policy learning via action diffusion. The International Journal of Robotics Research, pp. 02783649241273668, 2023. Pieter-Tjerk De Boer, Dirk P Kroese, Shie Mannor, and Reuven Y Rubinstein. A tutorial on the cross-entropy method. Annals of operations research, 134:19 67, 2005. Carl de Boor. Package for calculating with b-splines. SIAM Journal on Numerical Analysis, 14:57, 10 1973. doi: 10.1137/0714026. Nikolaus Hansen. The cma evolution strategy: A tutorial. ar Xiv preprint ar Xiv:1604.00772, 2016. Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840 6851, 2020. Taylor Howell, Nimrod Gileadi, Saran Tunyasuvunakool, Kevin Zakka, Tom Erez, and Yuval Tassa. Predictive sampling: Real-time behaviour synthesis with mujoco. ar Xiv preprint ar Xiv:2212.00541, 2022. Chih H Huang, Pranav Jadhav, Brian Plancher, and Zachary Kingston. prrtc: Gpu-parallel rrt-connect for fast, consistent, and low-cost motion planning. ar Xiv preprint ar Xiv:2503.06757, 2025. Huang Huang, Balakumar Sundaralingam, Arsalan Mousavian, Adithyavairavan Murali, Ken Goldberg, and Dieter Fox. Diffusionseeder: Seeding motion optimization with diffusion for rapid motion planning. ar Xiv preprint ar Xiv:2410.16727, 2024. Julius Jankowski, Lara Brudermüller, Nick Hawes, and Sylvain Calinon. Vp-sto: Via-point-based stochastic trajectory optimization for reactive robot behavior. In 2023 IEEE International Conference on Robotics and Automation (ICRA), pp. 10125 10131. IEEE, 2023. Published in Transactions on Machine Learning Research (08/2025) James J Kuffner and Steven M La Valle. Rrt-connect: An efficient approach to single-query path planning. In Proceedings 2000 ICRA. Millennium conference. IEEE international conference on robotics and automation. Symposia proceedings (Cat. No. 00CH37065), volume 2, pp. 995 1001. IEEE, 2000. Vince Kurtz. Hydrax: Sampling-based model predictive control on gpu with jax and mujoco mjx, 2024. https://github.com/vincekurtz/hydrax. Robert Tjarko Lange. evosax: Jax-based evolution strategies. In Proceedings of the Companion Conference on Genetic and Evolutionary Computation, pp. 659 662, 2023. R Connor Lawson, Linda Wills, and Panagiotis Tsiotras. Gpu parallelization of policy iteration rrt. In 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 4369 4374. IEEE, 2020. An T Le, Georgia Chalvatzaki, Armin Biess, and Jan R Peters. Accelerating motion planning via optimal transport. Advances in Neural Information Processing Systems, 36:78453 78482, 2023. An T Le, Kay Hansel, João Carvalho, Joe Watson, Julen Urain, Armin Biess, Georgia Chalvatzaki, and Jan Peters. Global tensor motion planning. IEEE Robotics and Automation Letters, 2025. Albert H Li, Preston Culbertson, Vince Kurtz, and Aaron D Ames. Drop: Dexterous reorientation via online planning. ar Xiv preprint ar Xiv:2409.14562, 2024. Matthias Lorenzen, Fabrizio Dabbene, Roberto Tempo, and Frank Allgöwer. Stochastic mpc with offline uncertainty sampling. Automatica, 81:176 183, 2017. Viktor Makoviychuk, Lukasz Wawrzyniak, Yunrong Guo, Michelle Lu, Kier Storey, Miles Macklin, David Hoeller, Nikita Rudin, Arthur Allshire, Ankur Handa, et al. Isaac gym: High performance gpu-based physics simulation for robot learning. ar Xiv preprint ar Xiv:2108.10470, 2021. David Q Mayne. Model predictive control: Recent developments and future promise. Automatica, 50(12): 2967 2986, 2014. Khang Nguyen, An T Le, Tien Pham, Manfred Huber, Jan Peters, and Minh Nhat Vu. Flowmp: Learning motion fields for robot planning with conditional flow matching. ar Xiv preprint ar Xiv:2503.06135, 2025. Jia Pan and Dinesh Manocha. Gpu-based parallel collision detection for fast motion planning. The International Journal of Robotics Research, 31(2):187 200, 2012. Nicolas Perrault, Qi Heng Ho, and Morteza Lahijanian. Kino-pax: Highly parallel kinodynamic samplingbased planner. IEEE Robotics and Automation Letters, 2025. Corrado Pezzato, Chadi Salmi, Elia Trevisan, Max Spahn, Javier Alonso-Mora, and Carlos Hernández Corbato. Sampling-based model predictive control leveraging parallelizable physics simulations. IEEE Robotics and Automation Letters, 2025. Cristina Pinneri, Shambhuraj Sawant, Sebastian Blaes, Jan Achterhold, Joerg Stueckler, Michal Rolinek, and Georg Martius. Sample-efficient cross-entropy method for real-time planning. In Conference on Robot Learning, pp. 1049 1065. PMLR, 2021. Tim Salimans, Jonathan Ho, Xi Chen, Szymon Sidor, and Ilya Sutskever. Evolution strategies as a scalable alternative to reinforcement learning. ar Xiv preprint ar Xiv:1703.03864, 2017. John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms, 2017. URL https://arxiv.org/abs/1707.06347. Balakumar Sundaralingam, Siva Kumar Sastry Hari, Adam Fishman, Caelan Garrett, Karl Van Wyk, Valts Blukis, Alexander Millane, Helen Oleynikova, Ankur Handa, Fabio Ramos, et al. Curobo: Parallelized collision-free robot motion generation. In 2023 IEEE International Conference on Robotics and Automation (ICRA), pp. 8112 8119. IEEE, 2023. Published in Transactions on Machine Learning Research (08/2025) Wil Thomason, Zachary Kingston, and Lydia E Kavraki. Motions in microseconds via vectorized samplingbased planning. In 2024 IEEE International Conference on Robotics and Automation (ICRA), pp. 8749 8756. IEEE, 2024. Emanuel Todorov, Tom Erez, and Yuval Tassa. Mujoco: A physics engine for model-based control. In 2012 IEEE/RSJ international conference on intelligent robots and systems, pp. 5026 5033. IEEE, 2012. Mark Towers, Ariel Kwiatkowski, Jordan Terry, John U Balis, Gianluca De Cola, Tristan Deleu, Manuel Goulão, Andreas Kallinteris, Markus Krimmel, Arjun KG, et al. Gymnasium: A standard interface for reinforcement learning environments. ar Xiv preprint ar Xiv:2407.17032, 2024. Julen Urain, An T Le, Alexander Lambert, Georgia Chalvatzaki, Byron Boots, and Jan Peters. Learning implicit priors for motion optimization. In 2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 7672 7679. IEEE, 2022. Bogdan Vlahov, Jason Gibson, Manan Gandhi, and Evangelos A Theodorou. Mppi-generic: A cuda library for stochastic optimization. ar Xiv preprint ar Xiv:2409.07563, 2024. Joe Watson and Jan Peters. Inferring smooth control: Monte carlo posterior policy iteration with gaussian processes. In Conference on Robot Learning, pp. 67 79. PMLR, 2023. Daan Wierstra, Tom Schaul, Tobias Glasmachers, Yi Sun, Jan Peters, and Jürgen Schmidhuber. Natural evolution strategies. The Journal of Machine Learning Research, 15(1):949 980, 2014. Grady Williams, Andrew Aldrich, and Evangelos A Theodorou. Model predictive path integral control: From theory to parallel computation. Journal of Guidance, Control, and Dynamics, 40(2):344 357, 2017. Haoru Xue, Chaoyi Pan, Zeji Yi, Guannan Qu, and Guanya Shi. Full-order sampling-based mpc for torquelevel locomotion control via diffusion-style annealing. ar Xiv preprint ar Xiv:2409.15610, 2024. Zeji Yi, Chaoyi Pan, Guanqi He, Guannan Qu, and Guanya Shi. Covo-mpc: Theoretical analysis of samplingbased mpc and optimal covariance design. In 6th Annual Learning for Dynamics & Control Conference, pp. 1122 1135. PMLR, 2024. Yanbo Zhang, Benedikt Hartl, Hananel Hazan, and Michael Levin. Diffusion models are evolutionary algorithms. ar Xiv preprint ar Xiv:2410.02543, 2024. Zichen Zhang, Jun Jin, Martin Jagersand, Jun Luo, and Dale Schuurmans. A simple decentralized crossentropy method. Advances in Neural Information Processing Systems, 35:36495 36506, 2022. Matt Zucker, Nathan Ratliff, Anca D Dragan, Mihail Pivtoraiko, Matthew Klingensmith, Christopher M Dellin, J Andrew Bagnell, and Siddhartha S Srinivasa. Chomp: Covariant hamiltonian optimization for motion planning. The International journal of robotics research, 32(9-10):1164 1193, 2013. Published in Transactions on Machine Learning Research (08/2025) A.1 Proof of Theorem 1 Let u F be any path that is uniformly continuous and has bounded variation TV (u) < . We begin by constructing a piecewise linear path approximating u, g M : [0, 1] U by dividing the interval [0, 1] into M 1 subintervals, i.e., [t1, t2], . . . , [t M 1, t M] with 0 = t1 < t2 < . . . < t M = 1. On each subinterval [ti, ti+1], we define the corresponding segment of g to approximate u g(t) = u(ti) + u(ti+1) u(ti) ti+1 ti (t ti), t [ti, ti+1]. (11) Then, we define a control path in G(M, N). Definition 4 (Path In G(M, N)). A path u : [0, 1] U is in G(M, N) (i.e., u G(M, N)) if and only if u is piecewise linear having M 1 segments and 1 i M, u(ti) Li. Lemma 1 (Piecewise Linear Path Approximation). Let g1, g2 be a piecewise linear function having the same number of partition points {g1(ti)}M i=1, {g2(ti)}M i=1 with 0 = t1 < t2 < . . . < t M = 1, g1 g2 < ϵ, if and only if g1(ti) g2(ti) < ϵ, 1 i M. Proof. Sufficiency. Given g1(ti) g2(ti) < ϵ, 1 i M, since g1, g2 are piecewise linear functions, the linear interpolation between partition points ti, ti+1 ensures that the difference between g1, g2 is maximized at the partition points. Consider g1, g2 on a segment [ti, ti+1] g1(t) g2(t) max{ g1(ti) g2(ti) , g1(ti+1) g2(ti+1) } < ϵ (12) Hence, g1 g2 = maxt [0,1] g1(t) g2(t) < ϵ. Necessity. Given g1 g2 < ϵ, then g1(ti) g2(ti) < ϵ, 1 i M. Now, we investigate that any piecewise linear path g with M 1 equal subintervals, approximating u F, uniformly converges to u as M . Lemma 2 (Convergence Of Linear Path Approximation). Let g M be any piecewise linear path approximating u F having M equal subintervals of width h = 1/M. Then, lim M u g M = 0. Proof. Since u is uniformly continuous on [0, 1], for any ϵ > 0, there exists δ > 0 such that for all t, s [0, 1], if |t s| < δ, then u(t) u(s) < ϵ Also, the variation of u within each subinterval approaches zero as M due to the uniformly continuous property. Hence, for sufficiently large M, each subinterval length h = 1 M < δ, and thus: sup t [ti,ti+1] u(t) gm(t) = sup t [ti,ti+1] u(t) u(ti) + u(ti+1) u(ti) sup t [ti,ti+1] u(t) u(ti) + sup t [ti,ti+1] u(ti+1) u(ti) h (t ti) < ϵ 2 = ϵ. (14) Taking the supremum over all t [0, 1] (i.e., over M equal subintervals), we obtain: u g M < ϵ. (15) Since ϵ > 0 is arbitrary and h 0 as M , it follows that: lim M u g M = 0. (16) Published in Transactions on Machine Learning Research (08/2025) We now prove that the random multipartite graph discretization is asymptotically dense in F. Specifically, as the number of layers M and the number of samples per layer N approach infinity, the graph will contain a path that uniformly approximates any continuous path in F. Theorem 1 (Asymptotic Path Coverage). Let u F be any control path and G(M, N) be a random multipartite graph with M layers and N uniform samples per layer (cf. Definition 1). Assuming a time sequence (i.e., knots) 0 = t1 < t2 < . . . < t M = 1 with equal intervals, associating with layers L1, . . . , LM G(M, N) respectively, then lim M,N min g G(M,N) u g = 0. Proof. First, Lemma 2 implies that there exists a sequence of linear piecewise g M, having M 1 equal intervals approximating u, converging to u as M . Let ˆg M G(M, N) be a control path in G (cf. Definition 4). Since the time sequence 0 = t1 < t2 < . . . < t M = 1 corresponding to layers L1, . . . , LM G(M, N) has equal intervals, we can consider ˆg M having M 1 segments approximating g M without loss of generality. Since U is open, for each i = 1, . . . , M, there exists a ball Bϵ(u(ti)) U, ϵ > 0. By definition ˆg M, g M has the same number of segments, the event ˆg M g M < ϵ is the event that, for each layer 1 i M, there is at least one point g M(ti) is sampled inside the ball Bϵ(ˆg M(ti)). The probability that none of the N samples in layer Li fall inside Bϵ(g M(ti)) is 1 µ(Bϵ(u(ti))) N e c N (17) for some c > 0. From Lemma 1, the probability that every layer contains at least one such sample, such that g M ˆg M < ϵ, is at least 1 Me c N, which converges to 1 as N . From Lemma 2, for sufficiently large M, we have u g M < ϵ. Now, due to U is compact, we can apply the triangle inequality as N u ˆg M u g M + g M ˆg M < ϵ + ϵ = 2ϵ. (18) Since ϵ was arbitrary, we conclude that lim M,N min ˆg M G(M,N) u ˆg M = 0. (19) A.2 Exploration Versus Exploitation Discussion We investigate the tensor sampling Definition 1 (with linear interpolation) versus MPPI sampling with horizon T, corresponding to global exploration versus local exploitation behaviors from the current system state. In particular, we remark on the entropy of path distributions in both methods in the discretized control setting with equal time intervals. Further investigation on the continuous control setting is left for future work. Let a discrete control path be τ = [u1, . . . , u T ] RT n, u U. Let P(τ) be the probability of sampling control path τ under a given planning method. The entropy is then defined as τ FT P(τ) log P(τ), (20) where FT F is the set of all possible discrete control paths τ of length T. Consider G(M, N), each node in layer Li is sampled independently from a uniform distribution over U, and path candidates are equivalent to sequences of node indices τ τ = (i1, i2, . . . , i M) {1, . . . , N}M (cf. Algorithm 1). Let S denote the set of all index sequences representing valid paths through the graph. Published in Transactions on Machine Learning Research (08/2025) The uniform distribution over S is given by PG(τ) = 1/|S|, where |S| = N M. Hence, the entropy of tensor sampling is H(PG) = X τ S (1/N M) log(1/N M) = log(N M) = M log N. (21) Indeed, as M, N , the entropy H(PG) , and the distribution over sampled paths in G becomes maximum entropy over FT among all discrete path distributions. Theorem 1 implies that FT F as M, N , and thus tensor sampling distribution becomes maximum entropy over F. Now, typical MPPI implementation generates control paths by perturbing a nominal trajectory τ with Gaussian noise (Vlahov et al., 2024) ut = ut + ϵt, ϵt N(0, Σ), and propagating the dynamics to generate a state trajectory. The path distribution PMPPI concentrates around τ and is generally non-uniform. The entropy is constant and computed in closed form H(PMPPI) = Tn 2 (1 + log(2π)) + T log det(Σ), (22) due to independent Gaussian noise over timestep (i.e., white noise kernel (Watson & Peters, 2023)). In general, tensor sampling serves as a configurable high-entropy sampling mechanism over control trajectory space, offering maximum exploration, while MPPI targets local improvement around a nominal trajectory, thereby performing exploitation. This distinction motivates the hybrid method, where we mix explorative (smooth) controls with local controls sampled from a typical white noise kernel. A.3 Task Details Here, we provide task details on the task, cost definitions, and their domain randomization. There exist motion capture sensors in Mu Jo Co to implement the tasks. For this paper, we deliberately design the task costs to be simple and set sufficiently short planning horizons to benchmark the exploratory capacity of algorithms. In practice, one may design dense guiding costs to make tasks easier. Navigation. A planar point mass moves in a bounded 2D space via velocity commands to reach a target while avoiding collisions. The state cost is defined as c(xt, ut) = α1 exp( λdwall(xt) + α2 xt xg 2 + α3 ut 2 , where dwall(xt) is the distance to the closest wall, xg is the goal position, and ut is the velocity control. Success is defined when the agent s distance to the target is sufficiently small. This task is extremely difficult for sampling-based MPC due to large local minima near the starting point. Crane. The agent controls a luffing crane via torque inputs to move a suspended payload to a target while minimizing oscillations. The cost function penalizes payload deviation and swing: c(xt, ut) = α1 xt xg 2 + α2 xt 2 , where xt is the payload tip position, xg is the target point, and xt is the tip velocity. Success is achieved when the payload tip is within a small radius of the target location. This task is difficult due to heavy modeling error and underactuation, which is common in real crane applications. Cube-In-Hand. Using velocity control of a dexterous LEAP hand, the agent must rotate a cube to match a randomly sampled target orientation. The cost combines position and orientation error: c(xt, ut) = α1d SE(3)(xt, xg)2 + α2 xt 2 , where xt and xg are the current cube and target poses, d SE(3) is the SE(3) distance metric between poses, and ut = xt. Success is defined when the combined position and orientation errors fall below a threshold. This task is difficult due to the high-dimensional, contact-rich, and failure mode of the falling cube. G1-Walk. A Unitree G1 humanoid robot tracks a motion-captured walking trajectory using position control. The cost is defined as the deviation from reference joint positions: c(xt, ut) = α1 xt xref(t + k) 2 , Published in Transactions on Machine Learning Research (08/2025) where xt and xref(t+k) are the current joint and reference joint positions, given the current control iteration k. Success is not binary but is measured by minimizing deviation from the reference joint configurations. Note that this cost is not designed for stable locomotion. The main challenge is maintaining motion tracking locomotion over long horizons with complex joint couplings. G1-Standup. The humanoid must rise from a lying pose to an upright standing posture. The cost penalizes deviation from upright pose and instability: c(xt, ut) = α1(ht h )2 + α2d SO(3)(Rtorso, Rg)2 + α3 qt qnominal 2 , where ht, h is the torso height and the standing height threshold, Rtorso t , Rg are the orientation of current and target torso. ht, Rtorso t , qt are elements of xt. Success is defined when the height of the torso exceeds a target threshold. The task is difficult due to large initial instability and the need to achieve balance in high-dimensional dynamics. Push T. A position-controlled end effector to push a T-shaped block to a goal pose. The cost measures block pose error c(xt, ut) = α1d SE(3)(xt, xg)2, where xt and xg are the current T-block and target poses. Success is achieved when the block s position and orientation errors are minimized. The task is challenging because contact dynamics is complex, requiring precise interaction strategies. Walker. A planar biped must walk forward at a desired velocity while maintaining an upright torso. The cost function penalizes deviation from target velocity and orientation: c(xt, ut) = α1(ht h )2 + α2(θt θ ) + α3(vt v )2 + α4 ut 2 , where ht, h is the torso height and the standing height threshold, vt, v is the forward and target velocity, θt, θ is the torso and target pitch angle. ht, θt, vt are elements of xt. Success is measured by stable forward motion and velocity tracking. The difficulty lies in generating stable gaits without explicit foot placement planning. Table 1 summarizes the environments used in our experiments, including their state and action space dimensions, control modalities, and whether domain randomization or task randomization was applied. We set the control horizon and sim step per plan such that they resemble realistic control settings. Table 1: Summary of environment properties. Task State Dim Action Dim Control Type Domain Randomization Navigation 4 2 Velocity Joint obs. noise, actuation gain, init position Crane 24 3 Torque Payload mass, inertia, joint damping, actuation gain Cube-In-Hand 39 16 Velocity Joint obs. noise, geom friction G1-Walk 142 29 Position Joint obs. noise, geom friction G1-Standup 71 29 Position Joint obs. noise, geom friction Push T 14 2 Position Geom friction, init pose Walker 18 6 Torque None (fixed init) A.4 Experiment Details & Discussions Comparison Experiment. We summarize the simulation settings for comparison experiments (cf. Section 4.2) in Table 2, and the hyperparameters used for MTP variants in Table 3. Tables 4 summarize the hyperparameters used for MPPI and PS (temperature is not relevant for PS). The same noise σ is used for MTP local samples. For evolutionary strategies (DE and Open AI-ES), we use default hyperparameters in evosax (Lange, 2023). Design Ablation. We conducted the sweep on β [0, 1] and the number of elites to experimentally study the algorithmic design. The MTP hyperparameters in Section 4.3 are the same as in Table 3. Each setting was evaluated with 4 random seeds. Published in Transactions on Machine Learning Research (08/2025) Table 2: Simulation Settings for Experiments Task Horizon t [s] Horizon Sim Step/Plan Sim Hz Num. Randomizations Navigation 0.05 20 2 100 8 Crane 0.4 2 16 500 32 Cube-In-Hand 0.04 3 2 100 8 G1-Walk 0.1 4 1 100 4 G1-Standup 0.2 3 1 100 4 Push T 0.1 5 10 1000 4 Walker 0.15 4 15 200 1 Table 3: MTP Hyperparameters Task M N σmin Elites β α Navigation 5 30 - - 1.0 - Crane 2 30 0.05 8 0.5 0.0 Cube-In-Hand 2 50 0.15 5 0.5 0.1 G1-Standup 2 100 0.2 100 0.05 0.0 G1-Walk 2 100 0.1 100 0.02 0.0 Push T 3 50 0.1 20 0.5 0.0 Walker 2 50 0.3 20 0.5 0.5 Table 4: PS/MPPI Hyperparameters Task Noise Std. σ Temperature Navigation 1.0 0.1 Crane 0.05 0.1 Cube-In-Hand 0.15 0.1 G1-Standup 0.2 0.1 G1-Walk 0.1 0.01 Push T 0.3 0.1 Walker 0.3 0.1 Sensitivity Ablation. We conducted β-mixing rate ablation study (cf. Section 4.4) by varying the β across different MTP variants. The MTP hyperparameters in Section 4.4 are the same as in Table 3. Each configuration was evaluated with 4 random seeds. Experimental Discussions. In tasks with well-shaped or dense reward structures and moderately nonlinear dynamics, exploration becomes less critical and nominal sampling methods (e.g., MPPI, PS) often suffice explaining MPPI s strong performance in G1-Standup, which benefits from fully actuated dynamics and informative rewards. However, in more challenging scenarios involving sparse rewards or highly nonlinear dynamics, such as Push T and under domain shifts in Crane, these locally guided strategies tend to struggle with inadequate exploration, often converging to suboptimal solutions or showing high performance variance. A representative case is the Navigation task (Figure 3), where the agent must discover velocity sequences to bypass obstacles and reach the goal a setting in which local Gaussian sampling clearly fails by getting trapped in local minima. To address these challenges, MTP introduces a structured high-entropy sampling mechanism along with a simple yet effective β-mixing strategy that balances global exploration and local exploitation. With careful tuning of β, M, and N, MTP demonstrates robust and consistent performance across diverse tasks. Mixing Rate Tuning. β determines the ratio between exploratory (tensor sampling) and exploitative (nominal sampling) samples, which is delicate to tune. As shown in our ablations in Fig. 6 and Fig. 5, high β values (e.g., 0.5) introduce strong exploration, which may benefit tasks with sparse rewards and not requiring delicate fixed-point stability (e.g., Push T, Cube-In-Hand, Navigation). Conversely, in tasks requiring high stability or precise actuation, such as G1-Standup or G1-Walk, lower β values (e.g., 0.05-0.1) tend to yield better performance by favoring consistent behavior while still injecting enough exploration to escape suboptimal solutions. A.5 Additional Ablation & Performance Benchmarks In this section, we conduct more MTP ablations to understand how hyperparameters affect MTP performance, and also briefly benchmark the baseline JAX implementations to confirm the real-time performance. Published in Transactions on Machine Learning Research (08/2025) MTP-Bspline p=2 MTP-Bspline p=3 MTP-Bspline p=4 MTP-Bspline p=5 0 20 40 60 80 100 Velocity Magnitude 0 20 40 60 80 100 0 200 400 600 800 1000 Figure 7: MTP-Bspline degree ablation. In G1-Walk, the Unitree G1 controlled Bspline degrees all roughly fall at 500 time steps. B-spline Degree Ablation. We investigate the sensitivity of MTP performance over B-spline interpolation degrees. The MTP hyperparameters are the same as in Table 3. Each setting was evaluated using 4 random seeds. In Fig. 7, we investigate the sensitivity of MTP performance over B-spline interpolation degrees. Results consistently show minimal performance differences across degrees (p = 2 to p = 5) in terms of both cost and velocity magnitude curves across different tasks. Given this insensitivity, we select the lower computational complexity B-spline p = 2 as our default choice for the MTP-Bspline method. Sweep M, N Ablation. We conducted an additional ablation to study the effect of varying M and N values of MTP variants on the Navigation task. We set β = 1 to use full tensor sampling. Each configuration was evaluated with 4 random seeds. According to Fig. 8, there exists a sweet point in selecting the number of layers 10 20 30 40 50 60 70 80 90 100 N 10 20 30 40 50 60 70 80 90 100 N 10 20 30 40 50 60 70 80 90 100 N Akima-spline Figure 8: Sweep M, N on Navigation environment with B = 256 to investigate the interplay between number of batch sample B, number of layer M, and number of control-waypoints per layer N. Each data point is the success rate over 4 seeds. The environment setting is as in Appendix A.3. M in tensor sampling. Roughly, for all MTP variants, increasing M initially improves task performance, as more layers provide sufficient path complexity, allowing the planner to escape local minima and generate diverse, globally exploratory trajectories. However, beyond a certain point, further increasing M degrades performance. This is due to the exponential growth in the number of possible paths O(N M), while the rollout budget B remains fixed. As a result, the sampled trajectory density becomes sparse relative to the vast number of paths, reducing effective coverage and leading to diminished exploration and performance. On the other hand, increasing N consistently improves performance by densifying the search at each layer. Published in Transactions on Machine Learning Research (08/2025) However, this comes at a higher computational cost. Therefore, a careful balance must be struck between M and N to maintain real-time control and effective control exploration. Planning Performance. Table 5 benchmarks the performance of our JAX-based implementation by measuring the wall-clock time of the JIT-compiled planning function on G1-Standup task. This function includes a single sampling, single trajectory rollout, single cost evaluation, and single parameter update. Note that this benchmark only serves as an exemplary simulated planning function performance with JAX JIT. In practice, we might have multiple search refinements, or multiple parameter updates in the control loops. The task setting is similar to Table 2, but we set the sim step per plan to 1 with B = 128. The results show that the initial JIT compilation incurs a significant one-time cost, as expected for Mu Jo Co XLA pipelines. However, after compilation, the per-step planning rates across MTP, MPPI, PS, and evolutionary baselines are roughly similar with the same batch sample B, as also reflected in Table 6, Table 7, and Table 8. The results confirm that the JIT [s] and Planning Time [ms] are algorithm-agnostic, which depends slightly on batch size B (i.e., Planning Time [ms] logarithmically increases with B) and on the environment dynamics. All algorithms remain real-time feasible on GPU-accelerated hardware, when implemented with JAX and Mu Jo Co XLA. Table 5: JAX implementation benchmark on G1-Standup, evaluated with 5 seeds on an Nvidia RTX 3090. MTP-Bspline MTP-Akima PS MPPI Open AI-ES DE JIT Time [s] 76.4 1.2 74.62 2.5 72.35 4.5 73.87 4.2 69.87 1.2 73.62 3.1 Planning Time [ms] 2.7 0.3 2.7 0.4 3.1 0.7 2.6 0.2 2.9 0.5 3.2 0.6 Table 6: Planning performance of MTP-Akima. Averaged over 5 seeds on an Nvidia RTX 4090. Batch Size B JIT Time [s] Planning Time [ms] Push T Crane Cube-In-Hand G1-Walk Push T Crane Cube-In-Hand G1-Walk 64 15.8 32.5 38.2 65.3 1.7 0.1 1.8 0.1 8.1 0.3 1.4 0.1 128 12.7 31.5 36.6 58.5 1.6 0.1 1.9 0.1 10.2 0.7 1.5 0.1 256 16.3 31.4 38.8 57.1 2.0 0.2 2.0 0.4 14.7 0.8 1.5 0.1 Table 7: Planning performance of MPPI. Averaged over 5 seeds on an Nvidia RTX 4090. Batch Size B JIT Time [s] Planning Time [ms] Push T Crane Cube-In-Hand G1-Walk Push T Crane Cube-In-Hand G1-Walk 64 14.9 32.3 37.7 62.9 1.7 0.1 1.7 0.1 8.2 0.3 1.4 0.1 128 18.4 29.9 36.7 58.2 1.7 0.1 1.8 0.1 10.4 0.5 1.4 0.1 256 14.9 30.3 37.5 55.7 1.9 0.1 1.8 0.1 15.2 0.9 1.4 0.1 Table 8: Planning performance of Open AI-ES. Averaged over 5 seeds on an Nvidia RTX 4090. Batch Size B JIT Time [s] Planning Time [ms] Push T Crane Cube-In-Hand G1-Walk Push T Crane Cube-In-Hand G1-Walk 64 15.1 32.1 39.2 61.5 1.7 0.1 1.8 0.1 8.2 0.3 1.6 0.1 128 19.0 30.1 38.4 57.9 1.7 0.1 1.9 0.1 10.3 0.4 1.5 0.1 256 16.1 30.9 42.3 55.1 2.0 0.1 1.8 0.1 14.9 0.7 1.5 0.1 Published in Transactions on Machine Learning Research (08/2025) This demonstrates that MTP, despite its global exploration capabilities, remains suitable for real-time control applications. Our implementation benefits from efficient JIT and vmap vectorization in JAX and is compatible with Mu Jo Co s XLA backend. These design choices ensure that sampling, rollout, and learning components of MTP are fully optimized and scalable, and they support advanced techniques such as online domain randomization. Overall, the benchmark confirms the practicality of deploying MTP in high-performance robotic control loops. Softmax Update Effect. Fig. 9 illustrates the impact of applying softmax weighting on elite candidates for updating the mean and standard deviation of control trajectories. The left plot demonstrates smoother and lower-variance control updates over time compared to updates without softmax weighting shown on the right. The smooth and stable updates afforded by softmax weighting are essential when effectively mixing global and local samples, highlighting its critical role in the MTP performance. -0.9 -0.6 -0.3 0 0.3 0.6 With softmax weighting -0.9 -0.6 -0.3 0 0.3 0.6 0.6 Without softmax weighting Figure 9: Softmax weighting ablation on Push T environment. Both control update trajectories converge to near-zero means with large variance at 100 timesteps, signifying task completion.