# timeconstrained_robust_mdps__6d6aee26.pdf

Time-Constrained Robust MDPs

Adil Zouitine 1,2, David Bertoin 1,3,6, Pierre Clavier 4,5

Matthieu Geist7, Emmanuel Rachelson2,6

1IRT Saint-Exupéry, 2ISAE-SUPAERO, Université de Toulouse,3IMT, INSA Toulouse 4École Polytechnique, CMAP, 5Inria Paris, He KA 6ANITI, 7Cohere {adil.zouitine, david.bertoin}@irt-saintexupery.com,

pierre.clavier@polytechnique.edu

Robust reinforcement learning is essential for deploying reinforcement learning algorithms in real-world scenarios where environmental uncertainty predominates. Traditional robust reinforcement learning often depends on rectangularity assumptions, where adverse probability measures of outcome states are assumed to be independent across different states and actions. This assumption, rarely fulfilled in practice, leads to overly conservative policies. To address this problem, we introduce a new time-constrained robust MDP (TC-RMDP) formulation that considers multifactorial, correlated, and time-dependent disturbances, thus more accurately reflecting real-world dynamics. This formulation goes beyond the conventional rectangularity paradigm, offering new perspectives and expanding the analytical framework for robust RL. We propose three distinct algorithms, each using varying levels of environmental information, and evaluate them extensively on continuous control benchmarks. Our results demonstrate that these algorithms yield an efficient tradeoff between performance and robustness, outperforming traditional deep robust RL methods in time-constrained environments while preserving robustness in classical benchmarks. This study revisits the prevailing assumptions in robust RL and opens new avenues for developing more practical and realistic RL applications.

1 Introduction

Robust MDPs capture the problem of finding a control policy for a dynamical system whose transition kernel is only known to belong to a defined uncertainty set. The most common framework for analyzing and deriving algorithms for robust MDPs is that of sa-rectangularity [1, 2], where probability measures on outcome states are picked independently in different source states and actions (in formal notation, P(s |s, a) and P(s | s, a) are independent of each other). This provides an appreciable decoupling of worst transition kernel search across time steps and enables sound algorithms like robust value iteration (RVI). But policies obtained for such sa-rectangular MDPs are by nature very conservative [3, 4], as they enable drastic changes in environment properties from one time step to the next, and the algorithms derived from RVI tend to yield very conservative policies even when applied to non-sa-rectangular robust MDP problems.

In this paper, we depart from the rectangularity assumption and turn towards a family of robust MDPs whose transition kernels are parameterized by a vector ψ. This parameter vector couples together the outcome probabilities in different (s, a) pairs, hence breaking the independence assumption that is problematic, especially in large dimension [3]. This enables accounting for the notion of transition

These authors contributed equally to this work

38th Conference on Neural Information Processing Systems (Neur IPS 2024).

model consistency across states and actions: outcome probabilities are not picked independently anymore but are rather set across the state and action spaces by drawing a parameter vector. In turn, we examine algorithms for solving such parameter-based robust MDPs when the parameter is constrained to follow a bounded evolution throughout time steps. Our contributions are the following.

1. We introduce a formal definition for parametric robust MDPs and time-constrained robust MDPs, discuss their properties and derive a generic algorithmic framework (Sec. 2). 2. We propose three algorithmic variants for solving time-constrained MDPs, named vanilla

TC , Stacked-TC and Oracle-TC (Sec. 4), which use different levels of information in the state space, and come with theoretical guaranties (Sec. 6). 3. These algorithms are extensively evaluated in Mu Jo Co [5] benchmarks, demonstrating they lead to non-conservative and robust policies (Sec. 5).

2 Problem statement

(Robust) MDPs. A Markov Decision Process (MDP) [6] is a model of a discrete-time, sequential decision making task. At each time step, from a state st S of the MDP, an action at A is taken and the state changes to st+1 according to a stationary Markov transition kernel p(st+1|st, at), while concurrently receiving a reward r(st, at). S and A are measurable sets and we write S and A the set of corresponding probability distributions. A stationary policy π( |s) is a mapping from states to distributions over actions, prescribing which action should be taken in s. The value function vπ p of policy π maps state s to the expected discounted sum of rewards Ep,π[P

t γtrt] when applying π from s for an infinite number of steps. An optimal policy for an MDP is one whose value function is maximal in any state. In a Robust MDP (RMDP) [1, 2], the transition kernel p is not set exactly and can be picked in an adversarial manner at each time step, from an uncertainty set P. Then, the pessimistic value function of a policy is vπ P(s) = minp P vπ p (s). An optimal robust policy is one that has the largest possible pessimistic value function v P in any state, hence yielding an adversarial maxπ minp optimization problem. Robust Value Iteration (RVI) [1, 7] solves this problem by iteratively computing the one-step lookahead best pessimistic value:

vn+1(s) = T Pvn(s) := max π(s) A min p P Ea π(s)[r(s, a) + Ep[vn(s )]].

The T P operator is called the robust Bellman operator and the sequence of vn functions converges to the robust value function v P as long as the adversarial transition kernel belongs to the simplex of S.

Zero-sum Markov Games. Robust MDPs can be cast as zero-sum two-players Markov games [8, 9] where B is the action set of the adversarial player. Writing π : S A B the policy of this adversary, the robust MDP problem turns to maxπ min π vπ, π, where vπ, π(s) is the expected sum of discounted rewards obtained when playing π (agent actions) against π (transition models) at each time step from s. This enables introducing the robust value iteration sequence of functions

vn+1(s) := T vn(s) := max π(s) A min π(s,a) S(T π, πvn)(s)

where T π, π := Ea π(s)[r(s, a) + γEs π(s,a)vn(s )] is a zero-sum Markov game operator. These operators are also γ contractions and converge to their respective fixed point vπ, π and v = v P [9]. This formulation will be useful to derive a practical algorithm in Section 4.

Often, this convergence is analyzed under the assumption of sa-rectangularity, stating that the uncertainty set P is a set product of independent subsets of S in each s, a pair. Quoting [1], rectangularity is a sort of independence assumption and is a minimal requirement for most theoretical results to hold. Within robust value iteration, rectangularity enables picking π(st, at) completely independently of π(st 1, at 1). To set ideas, let us consider the robust MDP of a pendulum, described by its mass and rod length. Varying this mass and rod length spans the uncertainty set of transition models. The rectangularity assumption induces that π(st, at) can pick a measure in S corresponding to a mass and a length that are completely independent from the ones picked in the previous time step. While this might be a good representation in some cases, in general it yields policies that are very conservative as they optimize for adversarial configurations which might not occur in practice.

We first step away from the rectangularity assumption and define a parametric robust MDP as an RMDP whose transition kernels are spanned by varying a parameter vector ψ (typically the mass

and rod length in the previous example). Choosing such a vector couples together the probability measures on successor states from two distinct (s, a) and (s , a ) pairs. The main current robust deep RL algorithms actually optimize policies for such parametric robust MDPs but still allow the parameter value at each time step to be picked independently of the previous time step.

Parametric MDPs. A parametric RMDP is given by the tuple (S, A, Ψ, pψ, r) where the transition kernel pψ(s, a) S is parameterized by ψ, and Ψ is the set of values ψ can take, equipped with an appropriate metric. This yields the robust value iteration update :

vn+1(s) = max π(s) A min ψ Ψ(T π ψ vn)(s) := max π(s) A min ψ Ψ Ea π(s)[r(s, a) + γEs pψ(s,a)vn(s )].

A parametric RMDP remains a Markov game and the Bellman operator remains a contraction mapping as long as pψ can reach only elements in the simplex of S, where the adversary s action set is the set of parameters instead of a (possibly sa-rectangular) set of transition kernels.

Time-constrained RMDPs (TC-RMDPs). We introduce TC-RMDPs as the family of parametric RMDPs whose parameter s evolution is constrained to be Lipschitz with respect to time. More formally a TC-RMDP is given by the tuple (S, A, Ψ, pψ, r, L), where ψt+1 ψt L, that is the parameter change is bounded through time. In the previous pendulum example, this might represent the wear of the rod which might lose mass or stretch length. Similarly, and for a larger scale illustration, TC-RMDPs enable representing the possible evolutions of traffic conditions in a path planning problem through a busy town. Starting from an initial parameter value ψ 1, the pessimistic value function of a policy π is non-stationary, as ψ0 is constrained to lay at most L-far away from ψ 1, ψ1 from ψ0, and so on. Generally, this yields non-stationary value functions as the uncertainty set at each time step depends on the previous uncertainty parameter. To regain stationarity without changing the TC-RMDP definition, we first change the definition of the adversary s action set. The adversary picks its actions in the constant set B = B(0Ψ, L), which is the ball of radius L centered in the null element in Ψ. In turn, the state of the Markov game becomes the pair s, ψ and the Markov game itself is given by the tuple ((S Ψ), A, B, pψ, r), where the Lipschitz constant L is included in B. Thus, given an action bt B and a previous parameter value ψt 1, the parameter value at time t is ψt = ψt 1 + bt. Then, we define the pessimistic value function of a policy as a function of both the state s and parameter ψ:

vπ B(s, ψ) := min (bt)t N, bt B E h X γtrt|ψ 1 = ψ, s0 = s, bt B, ψt = ψt 1 + bt, a π, st pψt i ,

v B(s, ψ) = max π(s,ψ) A vπ B(s, ψ).

In turn, an optimal robust policy is a function of s and ψ and the TC robust Bellman operators are:

vn+1(s, ψ) := T Bvn(s, ψ) := max π(s,ψ) A T π Bvn(s, ψ),

:= max π(s,ψ) A min b B Ea π(s)[r(s, a) + γEs pψ+b(s,a)vn(s , ψ + b)].

This iteration scheme converges to a fixed point according to Th. 2.1. Theorem 2.1. The time-constrained (TC) Bellman operators T π B and T B are contraction mappings. Thus the sequences vn+1 = T π Bvn and vn+1 = T Bvn, converge to their respective fixed points vπ B and v B.

Proof of Th. 2.1 can be found in Appendix B. We refer to this formulation as algorithm Oracle-TC (see Section 4 for implementation details) since an oracle makes the current parameter ψ visible to the agent. Therefore, it is possible to derive optimal policies for TC-RMDPs by iterated application of this TC Bellman operator. These policies have the form π(s, ψ). In the remainder of this paper, we extend state-of-the-art robust deep RL algorithms to the TC-RMDP framework. In particular, we compare their performance and robustness properties with respect to classical robust MDP formulations, we also discuss their relation with the π(s) robust policies of classical robust MDPs.

If the agent is unable to observe the state variable ψ, it is not possible to guarantee the existence of a stationary optimal policy of the form π(s). Similarly, there is no guarantee of convergence of value functions to a fixed point. Nonetheless, this scenario, in which access to the ψ parameter is not available, is more realistic in practice. It turns the two-player Markov game into a partially observable

Markov game, where one can still apply the TC Bellman operator but without these guarantees of convergence. We call vanilla TC the repeated application of the TC Bellman operator in this partially observable case. Vanilla TC will be tested in practice, and some theoretical properties of the objective function will be derived using the Lipschitz properties (Sec 6).

3 Related works

Since our method is a non-rectangular, Deep Robust RL algorithm, (possibly non-stationary for Stacked-TC and TC ), we discuss the following related work.

Non-stationary MDPs. First, non-stationarity has been studied in the Bandits setting in [10]. Then, for episodic, non-stationary MDPs [11, 12, 13] have explored and provided regret bounds for algorithms that use oracle access to the current reward and transition functions. More recently [14, 15] have facilitated oracle access by performing a count-based estimation of the reward and transition functions based on the recent history of interactions. Finally, for tabular MDPs, past data from a non-stationary MDP can be used to construct a full Bayesian model [16] or a maximum likelihood model [17] of the transition dynamics. We focus on the setting not restricted to tabular representations

Non-rectangular RMDPs. While rectangularity in practice is very conservative, it can be demonstrated that, in an asymptotic sense, non-rectangular ellipsoidal uncertainty sets around the maximum likelihood estimator of the transition kernel constitute the smallest possible confidence sets for the ground truth transition kernel, as implied by classical Cramér-Rao bounds. This is in accordance with the findings presented in 5 and Appendix A of [7]. More recently, [3] extends the rectangular assumptions using a factored uncertainty model, where all transition probabilities depend on a small number of underlying factors denoted w1, . . . , wr RS, such that each transition probability Psa for every (s, a) is a linear (convex) combination of these r factors. Finally, [4] use policy gradient algorithms for non-rectangular robust MDPs. While this work presents nice theoretical guarantees of convergence, there is no practical Deep RL algorithms for learning optimal robust policies.

Deep Robust RL Methods. Many Deep Robust algorithms exist such as M2TD3 [18], M3DDPG [19], or RARL [20], which are all based on the two player zero-sum game presented in 2. We will compare our method against these algorithms, except [19] which is outperformed by [18] in general. We also compare our algorithm to Domain randomization (DR) [21] that learns a value function V (s) = maxπ Ep U(P)V π p (s) which maximizes the expected return on average across a fixed (generally uniform) distribution on P. As such, DR approaches do not optimize the worst-case performance but still have good performance on average. Nonetheless, DR has been used convincingly in applications [22, 23]. Finally, the zero-sum game formulation has lead to the introduction of action robustness [9] which is a specific case of rectangular MDPs, in scenarios where the adversary shares the same action space as the agent and interferes with the agent s actions. Several strategies based on this idea have been proposed. One approach, the Game-theoretic Response Approach for Adversarial Defense (GRAD) [24] builds on the Probabilistic Action Robust MDP (PR-MDP) [9]. This method introduces time-constrained perturbations in both the action and state spaces and employs a game-theoretic approach with a population of adversaries. In contrast to GRAD, where temporal disturbances affect the transition kernel around a nominal kernel, our method is part of a broader setting in which the transition kernel is included in a larger uncertainty set. Robustness via Adversary Populations (RAP) [25] introduces a population of adversaries. This approach ensures that the agent develops robustness against a wide range of potential perturbations, rather than just a single one, which helps prevent convergence to suboptimal stationary points. Similarly, State Adversarial MDPs [26, 27, 28, 24] address adversarial attacks on state observations, effectively creating a partially observable MDP. Finally, using rectangularity assumptions, [29, 30] use Wasserstein and χ2 balls respectively for the uncertainty set in Robust RL.

4 Time-constrained robust MDP algorithms

The TC-RMDP framework addresses the limitations of traditional robust reinforcement learning by considering multifactorial, correlated, and time-dependent disturbances. Traditional robust reinforcement learning often relies on rectangularity assumptions, which are rarely met in real-

Figure 1: TC-RMDP training involves a temporally-constrained adversary aiming to maximize the effect of temporally-coupled perturbations. Conversely, the agent aims to optimize its performance against this time-constrained adversary. In orange, the oracle observation, and in blue the stacked observation.

world scenarios, leading to overly conservative policies. The TC-RMDP framework provides a more accurate reflection of real-world dynamics, moving beyond the conventional rectangularity paradigm.

We cast the TC-RMDP problem as a two-player zero-sum game, where the agent interacts with the environment, and the adversary (nature) changes the MDP parameters ψ. Our approach is generic and can be derived within any robust value iteration scheme, performing maxπ(s) A minψ Ψ Ea π(s)[r(s, a) + γEs pψ(s,a)vn(s )] updates, by modifying the adversary s action space and potentially the agent s state space to obtain updates of the form maxπ(s,ψ) A minb B Ea π(s)[r(s, a) + γEs pψ+b(s,a)vn(s )]. In Section 5, we will introduce time constraints within two specific robust value iteration algorithms, namely RARL [20] and M2TD3 [18] by simply limiting the search space for worst-case ψ at each step. This specific implementation extends the original actor-critic algorithms. For the sake of conciseness, we refer the reader to Appendix E.1 for details regarding the loss functions and algorithmic details.

Three variations of the algorithm are provided (illustrated in Figure 1) but all fall within the training loop of Algorithm 1.

Algorithm 1 Time-constrained robust training Input: Time-constrained MDP: (S, A, Ψ, pψ, r, L), Agent π, Adversary π for each interaction time step t do

at πt(st, ψt) // Sample an action with Oracle-TC or at πt(st, at 1, st 1) // Sample an action with Stacked-TC or at πt(st) // Sample an action with TC ψt+1 πt(st, at, ψt) // Sample the worst TC parameter st+1 pψt+1(st, at) // Sample a transition B B {(st, at, r (st, at) , ψt, ψt+1, st+1)} // Add transition to replay buffer {si, ai, r(si, ai), ψi, ψi+1, si+1}i [1,N] B // Sample a mini-batch of transitions πt+1 Update Policy(πt) // Update Agent πt+1 Update Policy( πt) // Update Adversary

Oracle-TC . As discussed in Section 2, the Oracle-TC version includes the MDP state and parameter value as input, π : S Ψ A. This method assumes that the agent has access to the true parameters of the environment, allowing it to make the most informed decisions and possibly reach the true robust value function. However, these parameters ψ are sometimes non-observable in practical scenarios, making this method not always feasible.

Stacked-TC . Since ψ might not be observable but may be approximately identified by the last transitions, the Stacked-TC policy uses the previous state and action as additional inputs in an attempt to replace ψ, π : S A S A. This approach leverages the information in the transitions, even though it might be insufficient for a perfect estimate of ψ. It aims to retain (approximately) the convergence properties of the Oracle-TC algorithm.

Vanilla TC . Finally, the vanilla TC version takes only the state, π : S A, as input, similar to standard robust MDP policies. This method does not attempt to infer the environmental parameters or the transition dynamics explicitly. Instead, it relies on the current state information to guide the agent s actions. While this version is the most straightforward and computationally efficient, it may not perform as robustly as the Oracle-TC or Stacked-TC versions in environments with significant temporal disturbances, since it attempts to solve a partially observable Markov game, for which there may not exist a stationary optimal policy based only on the observation. Despite this, it remains a viable option in scenarios where computational simplicity and quick decision-making are prioritized.

Experimental settings. This section evaluates the robust time-constrained algorithm s performance under severe time constraints and in the static settings. Experimental validation was conducted in continuous control scenarios using the Mu Jo Co simulation environments [5]. The approach was categorized into three variants. The Oracle-TC , where the agent accessed environmental parameters π(st, ψ); the Stacked-TC , where the agent took in input π(st, st 1, at 1); and the vanilla TC , which did not receive any additional inputs π(s). For each variant of the time-constrained algorithms, we applied them to RARL [20], and M2TD3 [18], renaming them TC-RARL and TC-M2TD3, respectively. The algorithms were tested against two state-of-the-art robust reinforcement learning algorithms, M2TD3 and RARL. Additionally, the Oracle versions of M2TD3 and RARL, where the agent s policy included ψ in the input π : S Ψ A, were evaluated for a more comprehensive assessment. Comparisons were also made with Domain Randomization (DR) [21] and vanilla TD3. [31] to ensure a thorough analysis. A 3D uncertainty set is defined in each environment P normalized between [0, 1]3. Appendix G provides detailed descriptions of uncertainty parameters. Performance metrics were gathered after five million steps to ensure a fair comparison. All baselines were constructed using TD3, and a consistent architecture was maintained across all TD3 variants. The results presented below were obtained by averaging over ten distinct random seeds. Appendices E.4, E.3, E.2, and H.2 discuss further details on hyperparameters, network architectures, and implementation choices, including training curves for our methods and baseline comparisons. In the following tables 1, 2, 3, the best performances are shown in bold. Oracle methods, with access to optimal information, are shown in black. Items in bold and green represent the best performances with limited information on ψ, making them more easily usable in many scenarios. When there is only one element in bold and green, this implies that the best overall method is a non-oracle method.

Ant Half Cheetah Hopper Humanoid Walker Agg.

Oracle M2TD3 1.11 0.07 0.95 0.1 1.51 0.84 2.07 0.19 1.31 0.36 1.39 0.31 Oracle RARL 0.72 0.18 0.71 0.05 1.3 0.28 2.8 1.62 0.19 0.2 0.86 0.47 Oracle-TC -M2TD3 1.61 0.32 2.76 0.16 7.79 1.0 1.69 2.14 1.49 0.41 3.07 0.81 Oracle-TC -RARL 1.66 0.32 2.63 0.12 6.86 1.46 0.19 1.68 1.34 0.11 2.54 0.74

Stacked-TC -M2TD3 1.33 0.21 2.4 0.19 6.51 0.59 1.42 1.44 1.69 0.33 2.1 0.55 Stacked-TC -RARL 1.48 0.22 1.76 0.08 3.28 0.27 1.39 0.57 1.01 0.21 1.78 0.27 TC -M2TD3 1.52 0.2 2.42 0.1 5.16 0.2 4.02 1.23 1.38 0.25 2.9 0.4 TC -RARL 1.57 0.26 1.54 0.15 2.04 0.49 1.25 1.91 0.89 0.2 1.46 0.6 TD3 0.0 0.19 0.0 0.27 0.0 1.27 0.0 1.18 0.0 0.23 0.0 0.63 DR 1.58 0.2 1.59 0.12 2.28 0.42 0.87 1.79 1.03 0.19 1.47 0.54 M2TD3 1.0 0.19 1.0 0.14 1.0 0.96 1.0 1.31 1.0 0.31 1.0 0.58 RARL 0.63 0.2 0.61 0.18 1.5 0.33 0.8 0.88 0.27 0.25 0.08 0.37 Table 1: Avg. of normalized time-coupled worst-case performance over 10 seeds for each method

Performance of TCRMDPs in worst-case time-constrained. Table 1 reports the worst-case time-constrained perturbation. To address the worst-case time-constrained perturbations for each trained agent π , we utilized a time-constrained adversary using TD3 algorithm π = minb B Ea π (s),b π(s,a,ψ)[r(s, a) + γEs pψ+b(s, a)vn(s )] within a perturbation radius of L = 0.001 for a total of 5 million steps. The sum of episode rewards was averaged over 10 episodes. A heuristic to choose the perturbation radius is explained in Appendix K. To compare metrics across different environments, each method s score v was standardized relative to the reference score of TD3. TD3 was trained on the environment using default transition function parameters, with its score denoted as v T D3. The M2TD3 score, v M2T D3, was used as the comparison target. The formula applied was (v v T D3)/(|v M2T D3 v T D3|). This positioned v T D3 as the minimal baseline and v M2T D3 as the target score. This standardisation provides a metric that quantifies the

improvement of each method over TD3 in relation to the improvement of M2TD3 over TD3. In each evaluation environment, agents trained with the time-constrained framework (indicated by TC in the method name) demonstrated significantly superior performance compared to those trained using alternative robust reinforcement learning approaches, including M2TD3 and RARL. Furthermore, they outperformed those trained through domain randomisation (DR). Notably, even without directly conditioning the policy with ψ, the time-constrained trained policies excelled against all baselines, achieving up to a 2.9-fold improvement. The non-normalized scores are reported in Appendix H. Additionally, when policies were directly conditioned by ψ and trained within the robust reinforcement learning framework, they tended to be overly conservative in the time-constrained framework. This is depicted in Table 1, comparing the performances of Oracle RARL, Oracle M2TD3, Oracle TC-RARL, and Oracle TC-M2TD3. Both policies also observe ψ. The only difference is that Oracle RARL and Oracle M2TD3 were trained in the robust reinforcement learning framework, while Oracle TC-RARL and Oracle TC-M2TD3 were trained in the time-constrained framework. The performance differences under worst-case time-coupled perturbation are as follows: for Oracle RARL (resp. M2TD3) and Oracle TC-RARL (resp. M2TD3), the values are 0.86 (1.39) vs. 2.54 (3.07). This observation highlights the need for a balance between robust training and flexibility in dynamic conditions. A natural question arises regarding the worst-case time-constrained perturbation. Was the adversary in the loop adequately trained, or might its suboptimal performance lead to overestimating the trained agent s reward against the worst-case perturbation? The adversary s performance was monitored during its training against all fixed-trained agents. The results in Appendix F show that our adversary converged.

Robust Time-Constrained Training under various time fixed adversaries. The method was evaluated against various fixed adversaries, focusing on the random fixed adversary shown in Figure 2. This evaluation shows that robustly trained agents can handle dynamic and unpredictable conditions. The random fixed adversary simulates stochastic changes by selecting a parameter ψt at each timestep within a radius of L = 0.1. This radius is 100 times larger than in our training methods. At the start of each episode, ψ0 is uniformly sampled from the uncertainty set ψ0 U(P). This tests the agents adaptability to unexpected changes. Figures 2a through 2e show our agents performance. Agents trained with our robust framework consistently outperformed those trained with standard methods. The policy was also assessed against five other fixed adversaries: cosine, exponential, linear, and logarithmic. Detailed results are provided in the Appendix. H.1.

(b) Half Cheetah

(d) Humanoid

Figure 2: Evaluation against a random fixed adversary, with a radius L = 0.1

Performance of Robust Time-Constrained MDPs in the static setting. In static environments, the Robust Time-Constrained algorithms were evaluated for worst-case and average performance metrics, shown in Tables 2 and 3. A fixed uncertainty set P was used, dividing each dimension of Ψ into ten segments, creating a grid of 1000 points (103). Each agent ran five episodes at each grid point, and the rewards were averaged. The scores were normalized as described for the

time-constrained adversary analysis in Table 1. The raw data is provided in Appendix 9 and 10. Performance scores were adjusted relative to the baseline v T D3 and v M2T D3. As a result, normalized results reveal distinct trends among agent configurations within the TC-RMDP framework. The Oracle TC-M2TD3 variant achieved an average score of 3.12 3, while the Stacked TC-M2TD3 scored 2.23, indicating its resilience. Furthermore, in the worst-case scenario, the TC-RARL and Stacked TC-RARL variants demonstrated adaptability, with TC-RARL scoring 0.92 and TC-M2TD3 scoring 1.02 2. This performance highlights its reliability in challenging static environments.

Ant Half Cheetah Hopper Humanoid Walker Agg

Oracle M2TD3 1.02 0.19 0.34 0.23 0.97 0.55 3.9 3.65 0.3 0.45 1.31 1.01 Oracle RARL 0.62 0.32 0.1 0.02 0.48 0.19 2.59 2.18 0.16 0.21 0.25 0.58 Oracle-TC -M2TD3 0.1 0.25 1.87 0.1 0.49 1.07 0.8 3.05 0.28 0.38 0.39 0.97 Oracle-TC -RARL 0.59 0.36 1.55 0.35 0.4 0.16 1.19 1.24 0.56 0.39 0.86 0.5

Stacked-TC -M2TD3 0.05 0.09 1.56 0.16 1.08 0.89 0.83 2.62 1.12 0.5 0.58 0.85 Stacked-TC -RARL 0.07 0.13 0.76 0.34 1.35 0.93 1.75 2.48 0.67 0.32 0.92 0.84 TC -M2TD3 0.06 0.08 1.49 0.23 1.29 0.29 1.21 2.44 1.19 0.34 1.02 0.68 TC -RARL 0.14 0.24 0.89 0.3 1.5 0.76 1.4 4.57 0.67 0.59 0.92 1.29 TD3 0.0 0.34 0.0 0.06 0.0 0.21 0.0 2.27 0.0 0.1 0.0 0.6 DR 0.06 0.16 1.07 0.36 0.86 0.82 0.04 4.1 0.57 0.37 0.52 1.16 M2TD3 1.0 0.27 1.0 0.16 1.0 0.65 1.0 3.32 1.0 0.63 1.0 1.01 RARL 0.44 0.3 0.13 0.08 0.5 0.22 0.44 2.94 0.12 0.09 0.33 0.73 Table 2: Avg. of normalized static worst-case performance over 10 seeds for each method

Ant Half Cheetah Hopper Humanoid Walker Agg

Oracle M2TD3 1.13 0.08 1.56 0.24 1.12 0.46 1.96 1.53 1.23 0.3 1.4 0.52 Oracle RARL 0.7 0.22 1.4 0.13 0.77 0.24 2.6 2.88 1.13 0.84 1.04 0.86 Oracle-TC -M2TD3 1.73 0.09 4.35 0.26 5.54 0.13 2.12 1.4 1.84 0.37 3.12 0.45 Oracle-TC -RARL 1.78 0.02 4.32 0.21 5.08 0.48 0.42 2.9 1.68 0.24 2.66 0.77

Stacked-TC -M2TD3 1.45 0.38 3.78 0.29 5.2 0.29 1.38 1.67 2.11 0.52 2.23 0.63 Stacked-TC -RARL 1.52 0.11 2.29 0.23 2.91 0.67 1.14 2.19 1.21 0.46 1.81 0.73 TC -M2TD3 1.6 0.06 3.71 0.24 4.4 0.6 3.28 2.52 1.56 0.23 2.91 0.73 TC -RARL 1.67 0.07 2.27 0.22 1.79 0.53 0.89 2.19 1.01 0.21 1.53 0.64 TD3 0.0 0.49 0.0 0.22 0.0 0.83 0.0 1.36 0.0 0.51 0.0 0.68 DR 1.65 0.05 2.31 0.27 2.08 0.49 1.15 2.47 1.22 0.34 1.68 0.72 M2TD3 1.0 0.11 1.0 0.19 1.0 0.55 1.0 1.43 1.0 0.65 1.0 0.59 RARL 0.69 0.13 1.3 0.54 0.99 0.11 0.47 1.92 0.35 0.83 0.3 0.71 Table 3: Avg. of normalized static average case performance over 10 seeds for each method

6 Some Theoretical properties of TC-MDPS.

6.1 On the optimal policy of TC

Following Lemma 3.3 of [1], it is known that in the rectangular case, there exists an optimal policy of the adversary that is stationary, provided that the actor policy is stationary. The TC-RMDP definition enforces a limitation on the temporal variation of the transition kernel. Consequently, all stationary adversarial policies are constrained by this stipulation. In turn, this guarantees that (under the hypothesis of sa-rectangularity) there always exists a solution to the TC-RMDP that is also a solution to the original RMDP. In other words: optimizing policies for TC-RMDPs do not exclude optimal solutions to the underlying RMDP. This sheds an interesting light on the search for robust optimal policies, since TC-RMDPs shrink the search space of optimal adversarial policies. In practice, this is confirmed by the previous experimental results (Figure 2) where the optimal agent policy found by either Oracle-TC , Stacked-TC , or vanilla TC actually outperforms the one found by M2TD3 or RARL in the non time-constrained setting.

6.2 Some Lipchitz-properties for non-stationary TC-RMPDS.

In this subsection we slightly depart from the framework defined in Section 2 and study the smoothness of the robust objective for vanilla TC or Stacked-TC . Th. 2.1 is no longer applicable as ψ is not observed. However, we can still give smoothness of the objective starting from Lipchichz conditions

on the evolution of the parameter that leads to smoothness on reward and transition kernel in the following definition 6.1. Definition 6.1 (Reward/Kernel Lipchitz TC-RMDPs [13]). We say that a parametric RDMPs is time constrained if the parameter change is bounded through time ie. ψt+1 ψt L. Moreover, we assume that this variation in parameter implies a variation in the reward and transition kernel of

s S, a A, Pt( | s, a) Pt+1( | s, a) 1 LP ; |rt(s, a) E [rt+1(s, a)]| Lr

From a theoretical point of view, a TC-RMDP can be seen as a sequence of stationary MDPs with time indexed reward and transition kernel rt, Pt that have continuity. More formally for Mt = (S, A, Ψ, pψt, rt, L = (Lp, Lr)), we can then define the sequence of stationary MDPs with Lipchitz variation :

ML t = n {Mt}t t =t0 ; Lr R s S, a A, P t( | s, a) Pt +1( | s, a) 1 LP ;

|rt (s, a) rt +1(s, a)| Lr o (1)

Defining rk t as the random variable corresponding to the reward function at time step t for stationary MDPs, but iterating with index k, the stationary rollout return at time t is G(π, Mt) = P k 0 γkrk t . Assuming that at a fixed t the reward and transition kernel rt, Pt are fixed, the robust objective function is:

JR(π, t) := min m={m t}t t =t0 ML t E [G (π, m)]

This leads to the following guarantee for vanilla TC and Stacked-TC algorithms. Theorem 6.2. Assume TC-RMPDS with L = (Lr, LP ) smoothness. Then t N, rt [0, 1],

t N+, t0 N+, |JR(π, t0) JR(π, t0 + t)| L t.

with L := γ (1 γ)2 LP + 1 1 γ Lr

This theorem states that a small variation of the Kernel and reward function will not affect too much the robust objective. In other terms, despite the fact that the TC Bellman operator may not admit a fixed point and yield a non-stationary sequence of value functions, variations of the expected return remain bounded. Proof of the Th. 6.2 can be found in Appendix C.

7 Conclusion

This paper presents a novel framework for robust reinforcement learning, which addresses the limitations of traditional methods that rely on rectangularity assumptions. These assumptions often result in overly conservative policies, which are not suitable for real-world applications where environmental disturbances are multifactorial, correlated, and time-constrained. In order to overcome these challenges, we proposed a new formulation, the Time-Constrained Robust Markov Decision Process (TC-RMDP). The TC-RMDP framework is capable of accurately capturing the dynamics of real-world environments, due to its consideration of the temporal continuity and correlation of disturbances. This approach resulted in the development of three algorithms: The three algorithms, Oracle-TC , Stacked-TC , vanilla TC which differ in the extent to which environmental information is incorporated into the decision-making process. A comprehensive evaluation of continuous control benchmarks using Mu Jo Co environments has demonstrated that the proposed TC-RMDP algorithms outperform traditional robust RL methods and domain randomization techniques. These algorithms achieved a superior balance between performance and robustness in both time-constrained and static settings. The results confirmed the effectiveness of the TC-RMDP framework in reducing the conservatism of policies while maintaining robustness. Moreover, we provided theoretical guaranties for Oracle-TC in Th. 2.1 and for Stacked-TC and vanilla TC in Th. 6.2. This study contributes to the field of robust reinforcement learning by introducing a time-constrained framework that more accurately reflects the dynamics observed in real-world settings. The proposed algorithms and theoretical contributions offer new avenues for the development of more effective and practical RL applications in environments with complex, time-constrained uncertainties.

Acknowledgments

The authors acknowledge the support of the DEEL and ENVIA projects, the funding of the AI Interdisciplinary Institute ANITI funding, through the French Investing for the Future PIA3 program under grant agreement ANR-19-PI3A-0004. This work benefited from computing resources from CALMIP under grant P21001. Pierre Clavier has been supported by a grant from Region Ile-de-France; DIM Math Innov and also by Fondation Mathématique Jacques Hadamard.

[1] Garud N Iyengar. Robust dynamic programming. Mathematics of Operations Research, 30(2):257 280, 2005.

[2] Arnab Nilim and Laurent El Ghaoui. Robust control of markov decision processes with uncertain transition matrices. Operations Research, 53(5):780 798, 2005.

[3] Vineet Goyal and Julien Grand-Clement. Robust markov decision process: Beyond rectangularity. ar Xiv preprint ar Xiv:1811.00215, 2018.

[4] Mengmeng Li, Tobias Sutter, and Daniel Kuhn. Policy gradient algorithms for robust mdps with non-rectangular uncertainty sets. ar Xiv preprint ar Xiv:2305.19004, 2023.

[5] Emanuel Todorov, Tom Erez, and Yuval Tassa. Mujoco: A physics engine for model-based control. In 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 5026 5033. IEEE, 2012.

[6] Martin L Puterman. Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons, 2014.

[7] Wolfram Wiesemann, Daniel Kuhn, and Berç Rustem. Robust markov decision processes. Mathematics of Operations Research, 38(1):153 183, 2013.

[8] Michael L Littman. Markov games as a framework for multi-agent reinforcement learning. In Machine learning proceedings 1994, pages 157 163. Elsevier, 1994.

[9] Chen Tessler, Yonathan Efroni, and Shie Mannor. Action robust reinforcement learning and applications in continuous control. In International Conference on Machine Learning, pages 6215 6224. PMLR, 2019.

[10] Aurélien Garivier and Eric Moulines. On upper-confidence bound policies for non-stationary bandit problems. ar Xiv preprint ar Xiv:0805.3415, 2008.

[11] Eyal Even-Dar, Sham M Kakade, and Yishay Mansour. Experts in a markov decision process. Advances in neural information processing systems, 17, 2004.

[12] Yasin Abbasi Yadkori, Peter L Bartlett, Varun Kanade, Yevgeny Seldin, and Csaba Szepesvári. Online learning in markov decision processes with adversarially chosen transition probability distributions. Advances in neural information processing systems, 26, 2013.

[13] Erwan Lecarpentier and Emmanuel Rachelson. Non-stationary markov decision processes, a worst-case approach using model-based reinforcement learning. Advances in neural information processing systems, 32, 2019.

[14] Pratik Gajane, Ronald Ortner, and Peter Auer. A sliding-window algorithm for Markov decision processes with arbitrarily changing rewards and transitions. ar Xiv preprint ar Xiv:1805.10066, 2018.

[15] Wang Chi Cheung, David Simchi-Levi, and Ruihao Zhu. Reinforcement learning under drift. ar Xiv preprint ar Xiv:1906.02922, 2019.

[16] Nicholas K Jong and Peter Stone. Bayesian models of nonstationary Markov decision processes. Planning and Learning in A Priori Unknown or Dynamic Domains, page 132, 2005.

[17] Melkior Ornik and Ufuk Topcu. Learning and planning for time-varying mdps using maximum likelihood estimation. ar Xiv preprint ar Xiv:1911.12976, 2019.

[18] Takumi Tanabe, Rei Sato, Kazuto Fukuchi, Jun Sakuma, and Youhei Akimoto. Max-min off-policy actor-critic method focusing on worst-case robustness to model misspecification. In Advances in Neural Information Processing Systems, 2022.

[19] Shihui Li, Yi Wu, Xinyue Cui, Honghua Dong, Fei Fang, and Stuart Russell. Robust multi-agent reinforcement learning via minimax deep deterministic policy gradient. In Proceedings of the AAAI conference on artificial intelligence, volume 33, pages 4213 4220, 2019.

[20] Lerrel Pinto, James Davidson, Rahul Sukthankar, and Abhinav Gupta. Robust adversarial reinforcement learning. In International Conference on Machine Learning, pages 2817 2826. PMLR, 2017.

[21] Josh Tobin, Rachel Fong, Alex Ray, Jonas Schneider, Wojciech Zaremba, and Pieter Abbeel. Domain randomization for transferring deep neural networks from simulation to the real world. In 2017 IEEE/RSJ international conference on intelligent robots and systems (IROS), pages 23 30. IEEE, 2017.

[22] Bhairav Mehta, Manfred Diaz, Florian Golemo, Christopher J Pal, and Liam Paull. Active domain randomization. In Conference on Robot Learning, pages 1162 1176. PMLR, 2020.

[23] Ilge Akkaya, Marcin Andrychowicz, Maciek Chociej, Mateusz Litwin, Bob Mc Grew, Arthur Petron, Alex Paino, Matthias Plappert, Glenn Powell, Raphael Ribas, et al. Solving rubik s cube with a robot hand. ar Xiv preprint ar Xiv:1910.07113, 2019.

[24] Yongyuan Liang, Yanchao Sun, Ruijie Zheng, Xiangyu Liu, Tuomas Sandholm, Furong Huang, and Stephen Marcus Mc Aleer. Game-theoretic robust reinforcement learning handles temporallycoupled perturbations. In The Second Workshop on New Frontiers in Adversarial Machine Learning, 2023.

[25] Eugene Vinitsky, Yuqing Du, Kanaad Parvate, Kathy Jang, Pieter Abbeel, and Alexandre Bayen. Robust reinforcement learning using adversarial populations. ar Xiv preprint ar Xiv:2008.01825, 2020.

[26] Huan Zhang, Hongge Chen, Chaowei Xiao, Bo Li, Mingyan Liu, Duane Boning, and Cho Jui Hsieh. Robust deep reinforcement learning against adversarial perturbations on state observations. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 21024 21037, 2020.

[27] Huan Zhang, Hongge Chen, Duane S Boning, and Cho-Jui Hsieh. Robust reinforcement learning on state observations with learned optimal adversary. In International Conference on Learning Representations, 2021.

[28] Samuel Stanton, Rasool Fakoor, Jonas Mueller, Andrew Gordon Wilson, and Alex Smola. Robust reinforcement learning for shifting dynamics during deployment. In Workshop on Safe and Robust Control of Uncertain Systems at Neur IPS, 2021.

[29] Mohammed Amin Abdullah, Hang Ren, Haitham Bou Ammar, Vladimir Milenkovic, Rui Luo, Mingtian Zhang, and Jun Wang. Wasserstein robust reinforcement learning. ar Xiv preprint ar Xiv:1907.13196, 2019.

[30] Pierre Clavier, Stéphanie Allassonière, and Erwan Le Pennec. Robust reinforcement learning with distributional risk-averse formulation. ar Xiv preprint ar Xiv:2206.06841, 2022.

[31] Scott Fujimoto, Herke Hoof, and David Meger. Addressing function approximation error in actor-critic methods. In International conference on machine learning, pages 1587 1596. PMLR, 2018.

[32] Shengyi Huang, Rousslan Fernand Julien Dossa, Chang Ye, Jeff Braga, Dipam Chakraborty, Kinal Mehta, and João G.M. Araújo. Cleanrl: High-quality single-file implementations of deep reinforcement learning algorithms. Journal of Machine Learning Research, 23(274):1 18, 2022.

The Appendix is structured as follow :

In Appendix B, proof for fix point of Oracle-TC algorithm for can be found.

In Appendix C, proof for algorithm Vanilla TC and Stacked-TC can found about robust objective.

In Appendix F, the adversary training was sanity-checked within the time-constrained evaluation.

In Appendix E, all implementation details are provided.

In Appendix H, all raw results are presented.

In Appendix I, the computer resources and training wall clock time are detailed.

In Appendix J, the broader impact and limitations are discussed.

In Appendix K, a heuristic for choosing the radius L

B Proof of Theorem 2.1

Proof. The Proof is similar to [1], using the fact that pψ+b belongs to the simplex, we get contraction of the operator and convergence to a fix point v B. Not that to converge to the fix point, there is no need of rectangularity.

Recall the recursion

vn+1(s, ψ) = max π(s,ψ) A min b B T π b vn(s, ψ) := max π(s,ψ) A min b B Ea π(s)[r(s, a) + γEs pψ+bvn(s , ψ )]

First we prove that the TC Robust Operator T π B is a contraction. Let V1, V2 Rn. Fix s S, and assume that T π BV1(s, ψ) T π BV2(s, ψ). Then fix ϵ > 0 and pick π s.t given s S,

inf b B Epψ+b [r (s, π(s)) + γV1 (s , ψ )] T π BV1(s, ψ ) ϵ. (3)

First we pick a probability measure p such that p = pψ+b, b B, such that

Ep [r (s, π(s)) + γV2 (s , ψ )] inf b B Ep [r (s, π(s)) + γV2 (s , ψ )] + ϵ. (4)

Then it lead to

0 T π BV1(s, ψ) T π BV2(s, ψ) inf p B Ep [r (s, π(s)) + γV1 (s , ψ )] + ϵ (5)

inf p B Ep [r (s, π(s)) + γV2 (s , ψ )] (6)

(Ep [r (s, π(s)) + γV1 (s , ψ )] + ϵ) (7)

(Ep [r (s, π(s)) + γV2 (s , ψ )] ϵ) , (8) = γEp [V1 V2] + 2ϵ, (9) γEp |V1 V2| + 2ϵ (10) γ V1 V2 + 2ϵ. (11)

where last inequality is Holder s inequality between L1 and L norms, use probability measure in the simplex such as p 1 = 1. Doing the same thing but in the case where T π BV1(s) T π BV2(s) , it holds

s S, |T π BV1(s) T π BV2(s)| γ V1 V2 + 2ϵ, (12)

i.e. T π BV1 T π BV2 γ V1 V2 + 2ϵ. As we can choose ϵ arbitrary small, this establishes that the TC Bellman operator is a γ-contraction. Since T π B is a contraction operator on a Banach space, the Banach fixed point theorem implies that the operator equation T π BV = V has a unique solution V = vπ B. A similar proof can be done for optimal operator T B. The only difference is the maximum operator which is 1 Lipschitz. So T B is also a contraction. Then, once proved that operators are γ contraction, following [1] (Th. 5), we have that the fixed point of this recursion is exactly :

vπ B(s, ψ) := min (bt)t N, bt B E h X γtrt|ψ 1 = ψ, s0 = s, bt B, ψt = ψt 1 + bt, a π, st pψt i ,

(13) v B(s, ψ) = max π(s,ψ) A vπ B(s, ψ). (14)

for (optimal) TC Bellman Operator.

C Guaranties for non-stationary Robust MDPS

Recall that we represent a non-stationary robust MDPs (NS-RMDP) as a stochastic sequence, {M = {Mi} t=t0, of stationary MDPs Mt M, where M is the set of all stationary MDPs. Each Mt is a tuple, S, A, Pt, rt, γ, ρ0 , where S The set of possible states is denoted by S, the set of actions by A, the discounting factor by γ, the start-state distribution by ρ0, and the reward distribution by rt. The reward distribution, denoted by rt : S A (R), is the probability distribution of rewards. The transition function, represented by Pt : S A (S), is the probability distribution of transitions between states. The symbol denotes the simplex. For all Mt M, we assume that the state space, action space, discount factor, and initial distribution remain fixed. A policy is represented as a function π : S (A). In general, we will use subscripts t to denote the time evolution during an episode and superscripts k to denote the time step assuming reward or kernel t which is stationary, assuming that the reward function is not changing as it is at time step t stationary. That rk t is the random variables corresponding to the state, action, and reward at time step t for stationary, but iterating with index k. Definition C.1 ( Lipschitz of sequence of MDPs). We denote the sequence of kernel and reward function P = {Pt} t=t0 and R = {rt} t=t0. We define a sequence of MDP is L = (Lr, LP )-Lipchitz if m = {mt} t=t0 ML with

ML t = n {Mt}t t =t0 ; (Lr, LP ) R2 + t N, s S, a A, Pt ( | s, a) Pt +1( | s, a) 1 LP

; |r t(s, a) rt +1(s, a)| Lr o

Assuming that for a time steps the reward function is stationary, we can compute the average return as: Definition C.2. Non-robust objective function, assuming that G(π, Mt) = P

k 0 γkrk t , the return is we assume stationary with reward function rt

J (π, t) = E[G(π, Mt)] = (1 γ) 1 X

s S dπ (s, Mt) X

a A π(a | s)rt(s, a). (15)

with dπ the state occupancy measure defined in (16). Definition C.3 (Robust (optimal) Return of NS-RMDPs). Let a return of π for any mt Mt be G(π, Mt) := P k=0 γkrk t with kernel transition Pt following π, with k, t, rk t [0, 1], and the Robust non-stationary expected return with variation of kernel

Let the robust performance of π for episode t be JR(π, t) := min m={m t}t t =t0 ML t E [G (π, m)]

D Proof Theom 6.2

t N+, t0 N+, |JR(π, t0) JR(π, t0 + t)| L t.

with L := γ (1 γ)2 LP + 1 1 γ Lr

Proof of Theorem 6.2. First, this difference can be upper bounded in the non robust case as:

By definition, we can rewrite non-robust objective function and occupancy measure as.

dπ (s, Mt) = (1 γ)

k=0 γk Pr (St = s | π, Mt) , (16)

J (π, Mt) = (1 γ) 1 X

s S dπ (s, Mt) X

a A π(a | s)rt(s, a). (17)

First, we can decompose the problem into sub-problems such that

t N+, t0 N+, |J(π, t0) J(π, t0 + t)| |

t =t0 |J(π, Mt ) J(π, Mt +1)| (18)

using triangular inequality. Looking at differences between two time steps:

(1 γ)|J(π, Mt) J(π, Mt+1)|

s S dπ (s, Mt) X

a A π(a | s)rt(s, a) X

s S dπ (s, Mt+1) X

a A π(a | s)rt+1(s, a)

a A π(a | s) (dπ (s, Mt) rt(s, a) dπ (s, Mt+1) rt+1(s, a))

a A π(a | s) (dπ (s, Mt) (rt+1(s, a) + (rt(s, a) Rt+1(s, a))) dπ (s, Mt+1) rt+1(s, a))

a A π(a | s) (dπ (s, Mt) dπ (s, Mt+1)) rt+1(s, a)

a A π(a | s)dπ (s, Mt) (rt(s, a) rt+1(s, a))

a A π(a | s) |dπ (s, Mt) dπ (s, Mt+1)| |rt+1(s, a)|

a A π(a | s)dπ (s, Mt) |rt(s, a) rt+1(s, a)|

a A π(a | s) |dπ (s, Mt) dπ (s, Mt+1)| + LR X

a A π(a | s)dπ (s, Mt)

s S |dπ (s, Mt) dπ (s, Mt+1)| + Lr

where (a) is triangular inequality, (b) is definition of of supremum of reward in the assumptions and reward bounded by 1. Then, let P π t R|S| |S| be the transition matrix ( s in rows and s in columns) resulting due to π and Pt, i.e., t, P π t (s , s) := Pr (St+1 = s | St = s, π, Mt), and let dπ ( , Mt) R|S| denote the vector of probabilities for each state, then Finally we can easily bound the difference of occupation measure as :

s S |dπ (s, Mt) dπ (s, Mt+1)| (19)

(d) γ(1 γ) 1 X

P π t (s , s) P π t+1 (s , s) dπ (s, Mt)

P π t (s , s) P π t+1 (s , s) dπ (s, Mt) (21)

= γ(1 γ) 1 X

a A π(a | s) (Pr (s | s, a, Mt) Pr (s | s, a, Mt+1))

dπ (s, Mt) (22)

a A π(a | s) |Pr (s | s, a, Mt) Pr (s | s, a, Mt+1)| dπ (s, Mt) (23)

= γ(1 γ) 1 X

a A π(a | s)dπ (s, Mt) X

s S |Pr (s | s, a, Mt) Pr (s | s, a, Mt+1)| (24)

a A π(a | s)dπ (s, Mt) LP (25)

= γLP (1 γ), (26)

which gives regrouping all terms:

|J(π, Mt) J(π, Mt+1)| Lr 1 γ + γLP (1 γ)2 . (27)

where the stationary MDP Mt+1 can be chosen as the minimum over the previous MDPs at time step t such as |Pr (s | s, a, Mt) Pr (s | s, a, Mt+1)| Lp. Rewriting previous equation (27), it holds that

h Eπ,P [G (π, m)] min m={m t}t+1 t =t Eπ,P [G (π, m)] i Lr 1 γ + γLP (1 γ)2 = L . (28)

Now considering non robust objective :

JR (π, t) JR (π, t + 1) (29)

min m={m t}t t =t0 ML E [G (π, m)] min m={m t}t+1 t =t0 ML t+1 E [G (π, m)]

min m={m t}t t =t0 ML t

h E [G (π, m)] min m={m t}t t =t0 ML t min m={m t}t+1 t =t E [G (π, m)] i (31)

max m={mt}t t=t0 ML t

h E [G (π, m)] min m={m t}t+1 t =t E [G (π, m)] i (32)

where first equality is the definition of the robust objective, second equality is decomposition of minimum across time steps and final inequality is simply a property of the min such as | min a min b| sup |a b|.

Finally plugging 28 in (32), it holds that

JR (π, t) JR (π, t + 1) (33)

min m={m t}t t =t0 ML Eπ,P [G (π, m)] min m={m t}t+1 t =t0 ML Eπ,P [G (π, m)]

Lr 1 γ + γLP (1 γ)2 .

Combining t times the previous equation gives the result:

t N+, t0 N+, |JR(π, t0) JR(π, t0 + t)| L t.

with L := γ (1 γ)2 LP + 1 1 γ Lr

E Implementation details

E.1 Algorithm

Algorithm 2 Time-constrained robust training Input: Time-constrained MDP: (S, A, Ψ, pψ, r, L), Agent π, Adversary π for each interaction time step t do

at πt(st, ψt) // Sample an action with Oracle-TC at πt(st, at 1, st 1) // Sample an action with Stacked-TC at πt(st) // Sample an action with TC ψt+1 πϕ(st, at, ψt) // Sample the worst TC parameter st+1 pψt+1(st, at) // Sample a transition B B {(st, at, r (st, at) , ψt, ψt+1, st+1)} // Add transition to replay buffer {si, ai, r(si, ai), ψi, ψi+1, si+1}i [1,N] B // Sample a mini-batch of transitions θc θc α θc LQ(θc) // Critic update phase θa θa α θa Lπ(θa) // Actor update ϕc ϕc + α ϕc L Q(ϕc) // Adversary Critic update phase ϕa ϕa + α ϕa L π(ϕa) // Adversary update

Note that in Time-constrained robust training Algorithm in section E.1, LQ and Lπ are as defined by [31] double critics and target network updates are omitted here for clarity

In Table 4, for the stack algorithm, si is defined as si si si 1 ai 1 for Stacked-TC , and for the Oracle-TC version, si si ψi.

Loss Function Equation

LQθc (TC-RARL) E [Qθc(si, ai) r(si, ai) + γ minj=1,2 Qθc(si+1, π(si+1))]

Lπ(θa) (TC-RARL) E [Qθc(si, πθa(si))]

L π(θa) (TC-RARL) E Qθc(si, ai, π(si, ai), ψi)

L Q(θc) (TC-RARL) E Qθc(si, ai) r(si, ai) + γ minj=1,2 Qθc(si+1, πθa(si+1), πθa(si+1, ai+1, ψi+1))

LQθc Shared (TC-M2TD3) E [Qθc(si, ai) r(si, ai) + γ minj=1,2 Qθc(si+1, πθa(si+1), πθa(si+1, ai+1, ψi+1))]

Lπ(θa) (TC-M2TD3) E [Qθc(si, ai, πθa(si, ai), ψi)]

L π(θa) (TC-M2TD3) E Qθc(si, ai, πθa(si, ai, ψi))

Table 4: Summary of Loss Functions for TD3 in TC-RARL and TC-M2TD3

E.2 Neural network architecture

We employ a consistent neural network architecture for both the baseline and our proposed methods for the actor and the critic components. The architecture s design ensures uniformity and comparability across different models.

The critic network is structured with three layers, as depicted in Figure 3a, the critic begins with an input layer that takes the state and action as inputs, which then passes through two fully connected linear layers of 256 units each. The final layer is a single linear unit that outputs a real-valued function, representing the estimated value of the state-action pair.

The actor neural network, shown in Figure 3b, also utilizes a three-layer design. It begins with an input layer that accepts the state as input. This is followed by two linear layers, each consisting of 256 units. The output layer of the actor neural network has a dimensionality equal to the number of dimensions of the action space.

(a) Critic neural network architecture

(b) Actor neural network architecture

Figure 3: Actor critic neural network architecture

We utilized the official M2TD3 [18] implementation provided by the original authors, accessible via the Git Hub repository for M2TD3 and Oracle M2TD3.

For the TC-M2TD3 or variants, we implemented the M2TD3 algorithm as specified. To simplify our approach, we omitted the implementation of the multiple ˆψ network and the system for resetting ˆψ. We replace with an adversary which π : S A Ψ Ψ which minimize Q(s, a, ψ).

We adopted the TD3 implementation from the Clean RL library, as detailed in [32].

F Sanity check on the adversary training in the time-constrained evaluation

A natural question arises regarding the worst time-constrained perturbation. Whether we adequately trained the adversary in the loop, or its suboptimal performance might lead to overestimating the trained agent reward against the worst-case time-constrained perturbation. We monitored the adversary s performance during its training against a fixed agent to address this. The attached figure shows the episodic reward (from the agent s perspective) during the adversary s training over 5 million timesteps, with a perturbation radius of L = 0.001. Each curve is an average of over 10 seeds. The plots show a rapid decline in reward during the initial stages of training, followed by quick stabilization. The episodic reward stabilizes early in the Ant (Figure 4a) environment, indicating

Hyperparameter Default Value Policy Std Rate 0.1 Policy Noise Rate 0.2 Noise Clip Policy Rate 0.5 Noise Clip Omega Rate 0.5 Omega Std Rate 1.0 Min Omega Std Rate 0.1 Maximum Steps 5e6 Batch Size 100 Hatomega Number 5 Replay Size 1e6 Policy Hidden Size 256 Critic Hidden Size 256 Policy Learning Rate 3e-4 Critic Learning Rate 3e-4 Policy Frequency 2 Gamma 0.99 Polyak 5e-3 Hatomega Parameter Distance 0.1 Minimum Probability 5e-2 Hatomega Learning Rate (ho_lr) 3e-4 Optimizer Adam Table 5: Hyperparameters for the M2TD3 Agent

Hyperparameter Default Value Maximum Steps 5e6 Buffer Size 1 106

Learning Rate 3 10 4 Gamma 0.99 Tau 0.005 Policy Noise 0.2 Exploration Noise 0.1 Learning Starts 2.5 104 Policy Frequency 2 Batch Size 256 Noise Clip 0.5 Action Min -1 Action Max 1 Optimizer Adam Table 6: Hyperparameters for the TD3 Agent

quick convergence. Similarly, in the Half Cheetah (Figure 4b) environment, the reward shows a sharp initial decline and stabilizes, suggesting effective training. For Hopper (Figure 4c), the reward decreases and then levels off, reflecting adversary convergence. Although the reward is more variable in the Humanoid Standup (Figure 4d) environment, it ultimately reaches a steady state, confirming adequate training. Finally, in the Walker environment, the reward pattern demonstrates a quick drop followed by stabilization, indicating convergence. These observations confirm that the adversaries were not undertrained. The rapid convergence to a stable performance across all environments ensures the accuracy of the worst time-constrained perturbations estimated during training.

G Uncertainty set in Mu Jo Co environments

The experiments of Section 5 follow the evaluation protocol proposed by [18] and based on Mu Jo Co environments [5]. These environments are designed with a 3D uncertainty sets. Table 7 lists all environments evaluated and their uncertainty sets. The uncertainty sets column defines the ranges of

(a) Ant: Episodic reward of the trained agent during adversary training

(b) Half Cheetah: Episodic reward of the trained agent during adversary training

(c) Hopper: Episodic reward of the trained agent during adversary training

(d) Humanoid Standup: Episodic reward of the trained agent during adversary training

(e) Walker: Episodic reward of the trained agent during adversary training

(f) Legend for algorithm

Figure 4: Episodic reward of the trained agent during the training of the adversary across different environments. Each plot represents the performance over 5 million timesteps, with rewards averaged across 10 seeds. The perturbation radius is set to L = 0.001 for all adversaries.

variation for the parameters within each environment. The reference parameters column indicates the nominal or default values. The uncertainty parameters column describes the physical meaning of each parameter.

H Raw results

Table 8 reports the non-normalized time-constrained (with a radius of L = 0.001) worst-case scores, averaged across 10 independent runs for each benchmark. Table 9 reports the static worst case score obtained by each agent across a grid of environments, also averaged across 10 independent runs for

Table 7: List of environment and parameters for the experiments Environment Uncertainty set P Reference values Uncertainty parameters Ant [0.1, 3.0] [0.01, 3.0] [0.01, 3.0]

(0.33, 0.04, 0.06) torso mass; front left leg mass; front right leg mass Half Cheetah [0.1, 4.0] [0.1, 7.0] [0.1, 3.0]

(0.4, 6.36, 1.53) world friction; torso mass; back thigh mass Hopper [0.1, 3.0] [0.1, 3.0] [0.1, 4.0]

(1.00, 3.53, 3.93) world friction; torso mass; thigh mass Humanoid Standup [0.1, 16.0] [0.1, 5.0] [0.1, 8.0]

(8.32, 1.77, 4.53) torso mass; right foot mass; left thigh mass Walker [0.1, 4.0] [0.1, 5.0] [0.1, 6.0]

(0.7, 3.53, 3.93) world friction; torso mass; thigh mass

Table 8: Avg. of time-constrained worst-case performance over 10 seeds for each method Environment Ant Half Cheetah Hopper Humanoid Standup Walker Method

Oracle M2TD3 5768 395 3521 187 1241 125 116232 1454 4559 757 Oracle RARL 4387 667 50 99 344 113 68979 10641 1811 342 Oracle TC-M2TD3 7268 704 7507 284 3386 323 114411 16973 5344 536 Oracle TC-RARL 7534 781 7526 311 3169 311 101182 12083 4783 382

Stacked TC-M2TD3 6502 450 6377 517 3047 394 85524 11448 5724 828 Stacked TC-RARL 6955 690 5319 223 1747 153 107913 5514 4152 483 TC-M2TD3 7181 591 6516 232 2511 45 129183 9120 4964 531 TC-RARL 7473 361 4989 284 1475 158 108669 17764 3971 351 DR 7247 925 4986 363 1642 104 109618 11479 4380 488 M2TD3 5622 435 3671 405 1120 220 102839 12987 4078 644 RARL 4348 574 382 366 240 104 106768 4051 2388 559 TD3 2259 424 1808 503 777 407 104877 12063 1893 361

each benchmark. Table 10 reports the static average case score obtained by each agent across a grid of environments, also averaged across 10 independent runs for each benchmark.

H.1 Fixed adversary evaluation

At the beginning of each episode, ψ0 U(Ψ) is selected for every fixed adversary. The episode length is 1000 steps. To begin with, the random fixed adversary simulates stochastic changes. It selects a parameter ψt at each timestep within a radius of L = 0.1, which is 100 times larger than in our training methods. This tests the agents adaptability to unexpected changes. In contrast, the cosine fixed adversary introduces deterministic changes using a cosine function. The radius

Table 9: Avg. of raw static worst-case performance over 10 seeds for each method

Ant Half Cheetah Hopper Humanoid Walker

dr 19.78 394.84 2211.48 915.64 245.01 167.21 64886.87 30048.79 1318.36 777.51 m2td3 2322.73 649.3 2031.9 409.7 273.6 131.9 71900.97 24317.35 2214.16 1330.4 oracle m2td3 2370.93 473.56 319.67 599.26 267.41 111.47 93123.84 26696.17 736.59 944.76 oracle rarl 1396.88 777.46 278.84 54.36 167.5 38.2 45635.24 15974.44 459.74 437.02 oracle tc m2td3 120.74 618.23 4273.31 246.91 168.7 217.94 58687.26 22321.77 710.99 799.08 oracle tc rarl 1328.27 890.49 3458.52 893.22 150.54 33.12 73276.78 9110.33 1299.88 812.63 rarl 960.11 744.01 211.8 218.73 170.46 45.73 67821.86 21555.24 360.31 186.06 stacked tc m2td3 242.98 212.98 3467.34 418.64 289.37 182.18 58515.04 19186.25 2475.58 1057.03 stacked tc rarl 37.77 320.71 1414.37 876.91 344.37 190.1 77357.17 18186.34 1518.86 668.13 td3 123.64 824.35 546.21 158.81 69.3 42.77 64577.24 16606.51 114.41 211.05 tc m2td3 271.34 191.15 3286.67 603.14 333.36 60.04 73428.2 17879.28 2603.59 706.63 tc rarl 209.04 575.89 1738.59 782.71 376.01 155.4 74840.68 33496.45 1513.65 1239.3

Table 10: Avg. of raw static average case performance over 10 seeds for each method

env name Ant Half Cheetah Hopper Humanoid Standup Walker algo-name

dr 7500.88 143.38 6170.33 442.57 1688.36 225.59 110939.89 22396.41 4611.24 463.42 m2td3 5577.41 316.95 4000.98 314.76 1193.32 254.9 109598.43 12992.35 4311.2 877.89 oracle m2td3 5958.21 237.32 4930.18 390.96 1249.62 212.74 118273.54 13891.06 4616.05 407.94 oracle rarl 4684.83 648.14 36.19 216.52 380.39 110.14 76920.58 26135.3 1451.39 1132.87 oracle-tc m2td3 7739.65 254.65 9536.92 429.14 3281.92 61.79 119737.21 12697.2 5442.85 499.78 oracle-tc-rarl 7889.1 56.0 9474.0 341.69 3071.17 220.39 104348.01 26249.98 5220.2 318.07 rarl 4650.55 395.03 206.71 887.25 276.37 52.42 104764.87 17400.85 2493.26 1113.74 stacked tc m2td3 6912.76 1116.81 8583.55 479.97 3124.06 133.27 88039.74 15138.11 5809.54 703.92 stacked-tc-rarl 7123.07 332.33 6130.71 384.05 2072.75 306.48 110843.2 19887.32 4596.79 619.2 vanilla 2600.43 1468.87 2350.58 357.12 733.18 382.06 100533.0 12298.37 2965.47 685.39 vanilla-tcm2td3 7366.9 169.58 8467.64 397.42 2756.5 273.91 130305.38 22865.1 5070.71 315.7 vanilla-tc-rarl 7558.58 198.37 6092.61 365.68 1558.26 242.17 108635.71 19848.21 4325.42 283.04

Environment Ant Half Cheetah Hopper Humanoid Standup Walker Method

Oracle TC-M2TD3 7782 915 8805 165 2365 199 116791 12572 5148 558 Oracle TC-RARL 8041 470 8727 227 2120 96 107733 11975 4896 326 Oracle M2TD3 5830 542 4445 186 1222 111 118861 1365 4584 787 Oracle RARL 4628 514 51 60 370 141 81583 16526 1829 356

Stacked TC-M2TD3 6888 738 7400 385 2114 138 88436 10750 5278 845 Stacked TC-RARL 7045 904 5992 427 1940 93 106213 6770 4430 389 TC-M2TD3 7156 692 7530 185 2157 112 129599 13556 4931 568 TC-RARL 7554 948 5751 482 1445 203 105144 16813 4112 329 DR 7572 629 6048 349 1416 168 105677 16333 4371 431 M2TD3 5588 516 4180 70 1018 271 107692 10414 4176 783 RARL 4347 567 240 250 390 130 103583 9217 1925 501 TD3 4017 518 2028 1250 1944 246 91205 11350 2860 419 Table 11: Avg. performance against time-constrained fixed random adversary with a radius L = 0.1 over 10 seeds for each method

of L = 0.1 scales the frequency of the cosine function, ensuring smooth and periodic variations. Additionally, a phase shift at the start of each episode ensures different starting points. Meanwhile, the linear fixed adversary employs a linear function. The parameters change linearly from the initial value to either one of a vertex of the uncertainty set Ψ over 1000 steps. Furthermore, the exponential fixed adversary uses an exponential function. Parameters change exponentially from the initial value to either of a vertex of the uncertainty set Ψ over 1000 steps. This ensures smooth and predictable variations. Similarly, the logarithmic fixed adversary uses a logarithmic function. Parameters change logarithmically from the initial value to either of a vertex of the uncertainty of the uncertainty set Ψ over 1000 steps, ensuring smooth and predictable variations. Agents trained under the time-constrained framework outperform all baselines in all environments for each fixed adversary, except when compared to the oracle TC method, which has access to ψ. In this case, the stacked-TC or TC methods outperform all baselines in all environments for the cosine, logarithmic, and exponential adversaries and outperform the fixed adversary baseline in 4 out of 5 instances for the random and linear fixed adversaries.

H.2 Agents training curve

We conducted training for each agent over a duration of 5 million steps, closely monitoring the cumulative rewards obtained over a trajectory spanning 1,000 steps. To enhance the reliability of our results, we averaged the performance curves across 10 different seeds. The graphs in Figures 5 to 15 illustrate how different training methods, including Domain Randomization, M2TD3, RARL, Oracle RARL ,Oracle M2TD3, TC RARL, TC M2TD3, Stacked TC RARL and Stacked TC M2TD3, impact agent performance across various environments.

Table 12: Avg. performance against time-constrained fixed cosine adversary with a radius L = 0.1 over 10 seeds for each method

Environment Ant Half Cheetah Hopper Humanoid Standup Walker Method

Oracle M2TD3 5528 637 3453 266 1016 48 119813 3281 3589 863 Oracle RARL 4550 626 79 34 371 140 74116 7890 1593 326 Oracle TC-M2TD3 7586 1345 8174 383 1946 104 115506 12470 4464 781 Oracle TC-RARL 7522 1435 7838 810 1735 138 110535 12702 4442 591

Stacked TC-M2TD3 6269 849 7173 509 1734 157 88157 10654 4888 567 Stacked TC-RARL 6510 1395 5385 445 1519 118 105696 5243 3848 404 TC-M2TD3 6350 769 6797 609 1413 167 130892 11544 4611 632 TC-RARL 7124 912 5109 348 1172 129 102864 13308 3548 545 DR 6975 992 5490 384 1091 169 109227 17068 3851 612 M2TD3 5330 684 3634 321 938 158 108136 9755 4126 644 RARL 4153 602 154 261 363 58 103366 7604 1689 465 TD3 4025 557 2784 370 1317 189 94352 10101 2020 355

Table 13: Avg. performance against a fixed linear adversary over 10 seeds for each method Environment Ant Half Cheetah Hopper Humanoid Standup Walker Method

Oracle M2TD3 5811 121 3560 167 1216 326 118829 846 4431 615 Oracle RARL 4447 600 122 64 308 62 81498 12860 1503 450 Oracle TC-M2TD3 7919 595 7495 268 2983 252 117610 11682 4952 415 Oracle TC-RARL 8069 151 7443 236 2805 352 110314 9354 4613 257 Stacked TC-M2TD3 7003 812 6365 335 2714 198 89556 11115 5256 675 Stacked TC-RARL 7328 251 5301 86 1616 137 105137 7903 4234 385 TC-M2TD3 7622 413 6451 246 2228 131 129501 10326 4844 417 TC-RARL 7675 143 4881 251 1277 288 105566 15551 3906 381 DR 7713 412 5290 103 1419 122 108711 16696 4307 309 M2TD3 5444 225 3810 69 970 323 106311 9771 4128 727 RARL 4651 446 218 138 346 22 101477 8947 1894 515 TD3 3493 475 1462 1246 1722 366 89934 10644 2396 416

Table 14: Avg. performance against a fixed logarithmic adversary over 10 seeds for each method

Environment Ant Half Cheetah Hopper Humanoid Standup Walker Method

Oracle M2TD3 5561 580 3086 163 957 165 119214 2525 4148 630 Oracle RARL 4911 177 145 67 293 49 79522 13470 1618 142 Oracle TC-M2TD3 7963 796 6625 204 2577 171 116664 11798 4818 451 Oracle TC-RARL 8061 821 6532 304 2572 177 108213 10684 4375 382

Stacked TC-M2TD3 7315 478 5863 290 2283 122 87691 11133 4931 735 Stacked TC-RARL 7514 62 4770 145 1426 197 104193 8030 3939 369 TC-M2TD3 7910 90 5657 280 1702 226 128467 10762 4664 412 TC-RARL 7686 208 4475 238 1082 298 104835 16040 3636 428 DR 7883 67 4721 146 1166 332 106171 16867 3995 313 M2TD3 5371 279 3565 105 802 271 104002 11606 4206 712 RARL 4620 763 231 110 340 44 102004 9925 1919 499 TD3 3678 623 576 983 1389 327 88952 11367 1956 360

Table 15: Avg. performance against a fixed exponential adversary over 10 seeds for each method

Environment Ant Half Cheetah Hopper Humanoid Standup Walker Method

Oracle M2TD3 5860 93 3780 137 1271 224 119205 1217 4767 815 Oracle RARL 4585 674 88 79 302 41 82063 13274 1611 342 Oracle TC-M2TD3 7491 624 8256 269 2894 244 118476 11683 5161 289 Oracle TC-RARL 7724 368 8000 250 3036 293 110092 10754 4650 503

Stacked TC-M2TD3 6903 365 7041 302 2721 214 91077 11945 5310 882 Stacked TC-RARL 7061 222 5741 249 1825 145 104793 6758 4376 342 TC-M2TD3 7318 299 7139 387 2408 113 129966 10823 4910 663 TC-RARL 7441 133 5326 220 1457 163 106491 14605 4017 439 DR 7389 206 5691 121 1564 99 106290 17502 4224 660 M2TD3 5466 318 3909 332 1062 272 107097 9551 4274 582 RARL 4556 729 228 181 351 24 102096 8291 2053 493 TD3 3771 228 2302 343 2201 219 90496 9487 2768 538

Table 16: Average wall-clock time for each algorithm

Wall-clock time TD3 14h M2TD3 16h RARL 18h TC 16h Stacked TC 16h Oracle TC 16h

I Computer ressources

All experiments were run on a desktop machine (Intel i9, 10th generation processor, 64GB RAM) with a single NVIDIA RTX 4090 GPU. Averages and standard deviations were computed from 10 independent repetitions of each experiment.

J Broader impact

This paper aims to advance robust reinforcement learning. It addresses general mathematical and computational challenges. These challenges may have societal and technological impacts, but we do not find it necessary to highlight them here.

J.1 Limitations

While our proposed Time-Constrained Robust Markov Decision Process (TC-RMDP) framework significantly advances robust reinforcement learning by addressing multifactorial, correlated, and time-dependent disturbances, several limitations must be acknowledged. The TC-RMDP framework assumes that the parameter vector ψ that governs environmental disturbances is known during training. In real-world applications, obtaining such detailed information may not always be feasible. This reliance on precise parameter knowledge limits the practical deployment of our algorithms in environments where ψ cannot be accurately measured or inferred. Our approach assumes that the environment s dynamics can be accurately parameterized and that these parameters remain within a predefined uncertainty set Ψ. This assumption might not hold in more complex or highly dynamic environments where disturbances are not easily parameterized or when the uncertainty set Ψ cannot comprehensively capture all possible variations. Consequently, the robustness of the learned policies might degrade when facing disturbances outside the considered parameter space. Addressing these limitations in future work.

(a) Training curve on Ant with Domain Randomization

(b) Training curve on Half Cheetah with Domain Randomization

(c) Training curve on Hopper with Domain Randomization

(d) Training curve on Humanoid Standup with Domain Randomization

(e) Training curve on Walker with Domain Randomization

Figure 5: Averaged training curves for the Domain Randomization method over 10 seeds

K A heuristic for choosing the radius L

We used a simple heuristic to select the radius L in our experiments. First, we normalized the parameters of the uncertainty set, scaling them to lie within the range [0, 1] for each dimension. We then divided this range by the number of steps in an episode, denoted as ep, yielding L = 1 ep. Our experiments, with ep = 1000, resulted in a radius of L = 0.001. This choice ensures that any initial parameter ψ0 can move to any point within the uncertainty set s parameters for an episode.

(a) Training curve on Ant with M2TD3

(b) Training curve on Half Cheetah with M2TD3

(c) Training curve on Hopper with M2TD3

(d) Training curve on Humanoid Standup with M2TD3

(e) Training curve on Walker with M2TD3

Figure 6: Averaged training curves for the M2TD3 method over 10 seeds

Neur IPS Paper Checklist

Question: Do the main claims made in the abstract and introduction accurately reflect the paper s contributions and scope? Answer: [Yes] Justification: We propose three distinct time-constrained robust mdp algorithm algorithms, each using varying levels of environmental information, and evaluate them extensively on continuous control benchmarks. Our results demonstrate that these algorithms yield an efficient tradeoff between performance and robustness, outperforming traditional deep robust RL methods in time-constrained environments while preserving robustness in classical benchmarks. Guidelines:

(a) Training curve on Ant with RARL

(b) Training curve on Half Cheetah with RARL

(c) Training curve on Hopper with RARL

(d) Training curve on Humanoid Standup with RARL

(e) Training curve on Walker with RARL

Figure 7: Averaged training curves for the RARL method over 10 seeds

The answer NA means that the abstract and introduction do not include the claims made in the paper. The abstract and/or introduction should clearly state the claims made, including the contributions made in the paper and important assumptions and limitations. A No or NA answer to this question will not be perceived well by the reviewers. The claims made should match theoretical and experimental results, and reflect how much the results can be expected to generalize to other settings. It is fine to include aspirational goals as motivation as long as it is clear that these goals are not attained by the paper.

2. Limitations

Question: Does the paper discuss the limitations of the work performed by the authors?

Answer: [Yes]

(a) Training curve on Ant with TD3

(b) Training curve on Half Cheetah with TD3

(c) Training curve on Hopper with TD3

(d) Training curve on Humanoid Standup with TD3

(e) Training curve on Walker with TD3

Figure 8: Averaged training curves for the TD3 method over 10 seeds

Justification: We have a section in the appendix discussing on the limitations of our work.

Guidelines:

The answer NA means that the paper has no limitation while the answer No means that the paper has limitations, but those are not discussed in the paper. The authors are encouraged to create a separate "Limitations" section in their paper. The paper should point out any strong assumptions and how robust the results are to violations of these assumptions (e.g., independence assumptions, noiseless settings, model well-specification, asymptotic approximations only holding locally). The authors should reflect on how these assumptions might be violated in practice and what the implications would be. The authors should reflect on the scope of the claims made, e.g., if the approach was only tested on a few datasets or with a few runs. In general, empirical results often depend on implicit assumptions, which should be articulated.

(a) Training curve on Ant with Oracle RARL

(b) Training curve on Half Cheetah with Oracle RARL

(c) Training curve on Hopper with Oracle RARL

(d) Training curve on Humanoid Standup with Oracle RARL

(e) Training curve on Walker with Oracle RARL

Figure 9: Averaged training curves for the Oracle RARL method over 10 seeds

The authors should reflect on the factors that influence the performance of the approach. For example, a facial recognition algorithm may perform poorly when image resolution is low or images are taken in low lighting. Or a speech-to-text system might not be used reliably to provide closed captions for online lectures because it fails to handle technical jargon. The authors should discuss the computational efficiency of the proposed algorithms and how they scale with dataset size. If applicable, the authors should discuss possible limitations of their approach to address problems of privacy and fairness. While the authors might fear that complete honesty about limitations might be used by reviewers as grounds for rejection, a worse outcome might be that reviewers discover limitations that aren t acknowledged in the paper. The authors should use their best

(a) Training curve on Ant with Oracle M2TD3

(b) Training curve on Half Cheetah with Oracle M2TD3

(c) Training curve on Hopper with Oracle M2TD3

(d) Training curve on Humanoid Standup with Oracle M2TD3

(e) Training curve on Walker with Oracle M2TD3

Figure 10: Averaged training curves for the Oracle M2TD3 method over 10 seeds

judgment and recognize that individual actions in favor of transparency play an important role in developing norms that preserve the integrity of the community. Reviewers will be specifically instructed to not penalize honesty concerning limitations.

3. Theory Assumptions and Proofs

Question: For each theoretical result, does the paper provide the full set of assumptions and a complete (and correct) proof?

Answer: [Yes]

Justification: Hypotheis of proof are well stated in Th.6.2 and 2.1

Guidelines:

The answer NA means that the paper does not include theoretical results.

(a) Training curve on Ant with Oracle M2TD3

(b) Training curve on Half Cheetah with Oracle M2TD3

(c) Training curve on Hopper with Oracle M2TD3

(d) Training curve on Humanoid Standup with Oracle M2TD3

(e) Training curve on Walker with Oracle M2TD3

Figure 11: Averaged training curves for the Oracle M2TD3 method over 10 seeds

All the theorems, formulas, and proofs in the paper should be numbered and crossreferenced.

All assumptions should be clearly stated or referenced in the statement of any theorems.

The proofs can either appear in the main paper or the supplemental material, but if they appear in the supplemental material, the authors are encouraged to provide a short proof sketch to provide intuition.

Inversely, any informal proof provided in the core of the paper should be complemented by formal proofs provided in appendix or supplemental material.

Theorems and Lemmas that the proof relies upon should be properly referenced.

4. Experimental Result Reproducibility

(a) Training curve on Ant with Oracle TC-RARL

(b) Training curve on Half Cheetah with Oracle TCRARL

(c) Training curve on Hopper with Oracle TCRARL

(d) Training curve on Humanoid Standup with Oracle TC-RARL

(e) Training curve on Walker with Oracle TCRARL

Figure 12: Averaged training curves for the Oracle TC-RARL method over 10 seeds

Question: Does the paper fully disclose all the information needed to reproduce the main experimental results of the paper to the extent that it affects the main claims and/or conclusions of the paper (regardless of whether the code and data are provided or not)?

Answer: [Yes]

Justification: We provide the algorithm, all detail implementations, hyperparameters.

Guidelines:

The answer NA means that the paper does not include experiments. If the paper includes experiments, a No answer to this question will not be perceived well by the reviewers: Making the paper reproducible is important, regardless of whether the code and data are provided or not.

(a) Training curve on Ant with Oracle TC-M2TD3

(b) Training curve on Half Cheetah with Oracle TCM2TD3

(c) Training curve on Hopper with Oracle TCM2TD3

(d) Training curve on Humanoid Standup with Oracle TC-M2TD3

(e) Training curve on Walker with Oracle TCM2TD3

Figure 13: Averaged training curves for the Oracle TC-M2TD3 method over 10 seeds

If the contribution is a dataset and/or model, the authors should describe the steps taken to make their results reproducible or verifiable.

Depending on the contribution, reproducibility can be accomplished in various ways. For example, if the contribution is a novel architecture, describing the architecture fully might suffice, or if the contribution is a specific model and empirical evaluation, it may be necessary to either make it possible for others to replicate the model with the same dataset, or provide access to the model. In general. releasing code and data is often one good way to accomplish this, but reproducibility can also be provided via detailed instructions for how to replicate the results, access to a hosted model (e.g., in the case of a large language model), releasing of a model checkpoint, or other means that are appropriate to the research performed.

(a) Training curve on Ant with Stacked TCM2TD3

(b) Training curve on Half Cheetah with Stacked TC-M2TD3

(c) Training curve on Hopper with Stacked TCM2TD3

(d) Training curve on Humanoid Standup with Stacked TC-M2TD3

(e) Training curve on Walker with Stacked TCM2TD3

Figure 14: Averaged training curves for the Stacked TC-M2TD3 method over 10 seeds

While Neur IPS does not require releasing code, the conference does require all submissions to provide some reasonable avenue for reproducibility, which may depend on the nature of the contribution. For example

(a) If the contribution is primarily a new algorithm, the paper should make it clear how to reproduce that algorithm. (b) If the contribution is primarily a new model architecture, the paper should describe the architecture clearly and fully. (c) If the contribution is a new model (e.g., a large language model), then there should either be a way to access this model for reproducing the results or a way to reproduce the model (e.g., with an open-source dataset or instructions for how to construct the dataset).

(a) Training curve on Ant with Stacked TC-RARL

(b) Training curve on Half Cheetah with Stacked TC-RARL

(c) Training curve on Hopper with Stacked TCRARL

(d) Training curve on Humanoid Standup with Stacked TC-RARL

(e) Training curve on Walker with Stacked TCRARL

Figure 15: Averaged training curves for the Stacked TC-RARL method over 10 seeds

(d) We recognize that reproducibility may be tricky in some cases, in which case authors are welcome to describe the particular way they provide for reproducibility. In the case of closed-source models, it may be that access to the model is limited in some way (e.g., to registered users), but it should be possible for other researchers to have some path to reproducing or verifying the results. 5. Open access to data and code

Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [Yes] Justification: We provide the code with clear instruction, to reproduce the experiments.

Guidelines:

The answer NA means that paper does not include experiments requiring code. Please see the Neur IPS code and data submission guidelines (https://nips.cc/ public/guides/Code Submission Policy) for more details. While we encourage the release of code and data, we understand that this might not be possible, so No is an acceptable answer. Papers cannot be rejected simply for not including code, unless this is central to the contribution (e.g., for a new open-source benchmark). The instructions should contain the exact command and environment needed to run to reproduce the results. See the Neur IPS code and data submission guidelines (https: //nips.cc/public/guides/Code Submission Policy) for more details. The authors should provide instructions on data access and preparation, including how to access the raw data, preprocessed data, intermediate data, and generated data, etc. The authors should provide scripts to reproduce all experimental results for the new proposed method and baselines. If only a subset of experiments are reproducible, they should state which ones are omitted from the script and why. At submission time, to preserve anonymity, the authors should release anonymized versions (if applicable). Providing as much information as possible in supplemental material (appended to the paper) is recommended, but including URLs to data and code is permitted.

6. Experimental Setting/Details

Question: Does the paper specify all the training and test details (e.g., data splits, hyperparameters, how they were chosen, type of optimizer, etc.) necessary to understand the results?

Answer: [Yes]

Justification: We provides all hyperparameters, optimizer and reinforcement learning algorithm. All those details is provided in the appendix.

Guidelines:

The answer NA means that the paper does not include experiments. The experimental setting should be presented in the core of the paper to a level of detail that is necessary to appreciate the results and make sense of them. The full details can be provided either with the code, in appendix, or as supplemental material.

7. Experiment Statistical Significance

Question: Does the paper report error bars suitably and correctly defined or other appropriate information about the statistical significance of the experiments?

Answer: [Yes]

Justification: We profide our computers resources to runs all experiments.

Guidelines:

The answer NA means that the paper does not include experiments. The authors should answer "Yes" if the results are accompanied by error bars, confidence intervals, or statistical significance tests, at least for the experiments that support the main claims of the paper. The factors of variability that the error bars are capturing should be clearly stated (for example, train/test split, initialization, random drawing of some parameter, or overall run with given experimental conditions). The method for calculating the error bars should be explained (closed form formula, call to a library function, bootstrap, etc.) The assumptions made should be given (e.g., Normally distributed errors). It should be clear whether the error bar is the standard deviation or the standard error of the mean.

It is OK to report 1-sigma error bars, but one should state it. The authors should preferably report a 2-sigma error bar than state that they have a 96% CI, if the hypothesis of Normality of errors is not verified. For asymmetric distributions, the authors should be careful not to show in tables or figures symmetric error bars that would yield results that are out of range (e.g. negative error rates). If error bars are reported in tables or plots, The authors should explain in the text how they were calculated and reference the corresponding figures or tables in the text.

8. Experiments Compute Resources

Question: For each experiment, does the paper provide sufficient information on the computer resources (type of compute workers, memory, time of execution) needed to reproduce the experiments?

Answer: [Yes]

Justification: We have a section in the appendix discussing of the computer resources for our work.

Guidelines:

The answer NA means that the paper does not include experiments. The paper should indicate the type of compute workers CPU or GPU, internal cluster, or cloud provider, including relevant memory and storage. The paper should provide the amount of compute required for each of the individual experimental runs as well as estimate the total compute. The paper should disclose whether the full research project required more compute than the experiments reported in the paper (e.g., preliminary or failed experiments that didn t make it into the paper).

9. Code Of Ethics

Question: Does the research conducted in the paper conform, in every respect, with the Neur IPS Code of Ethics https://neurips.cc/public/Ethics Guidelines?

Answer: [Yes]

Justification: Our contribution aims to safer RL algorithms, and we conducted our experiment of open source benchmarks.

Guidelines:

The answer NA means that the authors have not reviewed the Neur IPS Code of Ethics. If the authors answer No, they should explain the special circumstances that require a deviation from the Code of Ethics. The authors should make sure to preserve anonymity (e.g., if there is a special consideration due to laws or regulations in their jurisdiction).

10. Broader Impacts

Question: Does the paper discuss both potential positive societal impacts and negative societal impacts of the work performed?

Answer: [Yes]

Justification: We discuss on the broader impact in the conclusion and the appendix

Guidelines:

The answer NA means that there is no societal impact of the work performed. If the authors answer NA or No, they should explain why their work has no societal impact or why the paper does not address societal impact. Examples of negative societal impacts include potential malicious or unintended uses (e.g., disinformation, generating fake profiles, surveillance), fairness considerations (e.g., deployment of technologies that could make decisions that unfairly impact specific groups), privacy considerations, and security considerations.

The conference expects that many papers will be foundational research and not tied to particular applications, let alone deployments. However, if there is a direct path to any negative applications, the authors should point it out. For example, it is legitimate to point out that an improvement in the quality of generative models could be used to generate deepfakes for disinformation. On the other hand, it is not needed to point out that a generic algorithm for optimizing neural networks could enable people to train models that generate Deepfakes faster. The authors should consider possible harms that could arise when the technology is being used as intended and functioning correctly, harms that could arise when the technology is being used as intended but gives incorrect results, and harms following from (intentional or unintentional) misuse of the technology. If there are negative societal impacts, the authors could also discuss possible mitigation strategies (e.g., gated release of models, providing defenses in addition to attacks, mechanisms for monitoring misuse, mechanisms to monitor how a system learns from feedback over time, improving the efficiency and accessibility of ML).

11. Safeguards

Question: Does the paper describe safeguards that have been put in place for responsible release of data or models that have a high risk for misuse (e.g., pretrained language models, image generators, or scraped datasets)?

Answer: [NA]

Justification:

Guidelines:

The answer NA means that the paper poses no such risks. Released models that have a high risk for misuse or dual-use should be released with necessary safeguards to allow for controlled use of the model, for example by requiring that users adhere to usage guidelines or restrictions to access the model or implementing safety filters. Datasets that have been scraped from the Internet could pose safety risks. The authors should describe how they avoided releasing unsafe images. We recognize that providing effective safeguards is challenging, and many papers do not require this, but we encourage authors to take this into account and make a best faith effort.

12. Licenses for existing assets

Question: Are the creators or original owners of assets (e.g., code, data, models), used in the paper, properly credited and are the license and terms of use explicitly mentioned and properly respected?

Answer: [Yes]

Justification: The authors cite the original paper that produced the code package or dataset.

Guidelines:

The answer NA means that the paper does not use existing assets. The authors should cite the original paper that produced the code package or dataset. The authors should state which version of the asset is used and, if possible, include a URL. The name of the license (e.g., CC-BY 4.0) should be included for each asset. For scraped data from a particular source (e.g., website), the copyright and terms of service of that source should be provided. If assets are released, the license, copyright information, and terms of use in the package should be provided. For popular datasets, paperswithcode.com/datasets has curated licenses for some datasets. Their licensing guide can help determine the license of a dataset. For existing datasets that are re-packaged, both the original license and the license of the derived asset (if it has changed) should be provided.

If this information is not available online, the authors are encouraged to reach out to the asset s creators. 13. New Assets

Question: Are new assets introduced in the paper well documented and is the documentation provided alongside the assets? Answer: [NA] Justification: Guidelines:

The answer NA means that the paper does not release new assets. Researchers should communicate the details of the dataset/code/model as part of their submissions via structured templates. This includes details about training, license, limitations, etc. The paper should discuss whether and how consent was obtained from people whose asset is used. At submission time, remember to anonymize your assets (if applicable). You can either create an anonymized URL or include an anonymized zip file. 14. Crowdsourcing and Research with Human Subjects

Question: For crowdsourcing experiments and research with human subjects, does the paper include the full text of instructions given to participants and screenshots, if applicable, as well as details about compensation (if any)? Answer: [NA] Justification: Guidelines:

The answer NA means that the paper does not involve crowdsourcing nor research with human subjects. Including this information in the supplemental material is fine, but if the main contribution of the paper involves human subjects, then as much detail as possible should be included in the main paper. According to the Neur IPS Code of Ethics, workers involved in data collection, curation, or other labor should be paid at least the minimum wage in the country of the data collector. 15. Institutional Review Board (IRB) Approvals or Equivalent for Research with Human Subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or institution) were obtained? Answer: [NA] Justification: Guidelines:

The answer NA means that the paper does not involve crowdsourcing nor research with human subjects. Depending on the country in which research is conducted, IRB approval (or equivalent) may be required for any human subjects research. If you obtained IRB approval, you should clearly state this in the paper. We recognize that the procedures for this may vary significantly between institutions and locations, and we expect authors to adhere to the Neur IPS Code of Ethics and the guidelines for their institution. For initial submissions, do not include any information that would break anonymity (if applicable), such as the institution conducting the review.