# tempo_adaptation_in_nonstationary_reinforcement_learning__558933f7.pdf Tempo Adaptation in Non-stationary Reinforcement Learning Hyunin Lee1, Yuhao Ding1 Jongmin Lee1 Ming Jin2, Javad Lavaei1 Somayeh Sojoudi1 1UC Berkeley, Berkeley, CA 94709 2Virginia Tech, Blacksburg, VA 24061 {hyunin,yuhao_ding,jongmin.lee,lavaei,sojoudi}@berkeley.edu jinming@vt.edu We first raise and tackle a time synchronization issue between the agent and the environment in non-stationary reinforcement learning (RL), a crucial factor hindering its real-world applications. In reality, environmental changes occur over wall-clock time (t) rather than episode progress (k), where wall-clock time signifies the actual elapsed time within the fixed duration t [0,T]. In existing works, at episode k, the agent rolls a trajectory and trains a policy before transitioning to episode k +1. In the context of the time-desynchronized environment, however, the agent at time tk allocates t for trajectory generation and training, subsequently moves to the next episode at tk+1 = tk + t. Despite a fixed total number of episodes (K), the agent accumulates different trajectories influenced by the choice of interaction times (t1,t2,...,t K), significantly impacting the suboptimality gap of the policy. We propose a Proactively Synchronizing Tempo (Pro ST) framework that computes a suboptimal sequence {t1,t2,...,t K}(= {t}1 K) by minimizing an upper bound on its performance measure, i.e., the dynamic regret. Our main contribution is that we show that a suboptimal {t}1 K trades-off between the policy training time (agent tempo) and how fast the environment changes (environment tempo). Theoretically, this work develops a suboptimal {t}1 K as a function of the degree of the environment s non-stationarity while also achieving a sublinear dynamic regret. Our experimental evaluation on various high-dimensional nonstationary environments shows that the Pro ST framework achieves a higher online return at suboptimal {t}1 K than the existing methods. 1 Introduction The prevailing reinforcement learning (RL) paradigm gathers past data, trains models in the present, and deploys them in the future. This approach has proven successful for stationary Markov decision processes (MDPs), where the reward and transition functions remain constant [1 3]. However, challenges arise when the environments undergo significant changes, particularly when the reward and transition functions are dependent on time or latent factors [4 6], in non-stationary MDPs. Managing non-stationarity in environments is crucial for real-world RL applications. Thus, adapting to changing environments is pivotal in non-stationary RL. This paper addresses a practical concern that has inadvertently been overlooked within traditional nonstationary RL environments, namely, the time synchronization between the agent and the environment. We raise the impracticality of utilizing episode-varying environments in existing non-stationary RL * Corresponding authors. This work was supported by grants from ARO, ONR, AFOSR, NSF, and the UC Noyce Initiative. 37th Conference on Neural Information Processing Systems (Neur IPS 2023). Figure 1: (a) 2D goal reacher in a time-desynchronized environment for one policy update, where the agent learns an inaccurate policy on an accurate model; (b) For three policy updates, the agent learns a near-optimal policy on an inaccurate model; (c) Rewards per episode in 2D goal reacher with four model-free baselines, where Pro ST-T is one of our proposed methods. research, as such environments do not align with the real-world scenario where changes occur regardless of the agent s behavior. In an episode-varying environment, the agent has complete control over determining the time to execute the episode k, the duration of policy training between the episodes k and k + 1, and the transition time to the episode k + 1. The issue stems from the premise that the environment undergoes dynamic changes throughout the course of each episode, with the rate of non-stationarity contingent upon the behavior exhibited by the agent. However, an independent wall-clock time (t) exists in a real-world environment, thereby the above three events are now recognized as wall-clock time tk, training time t, and tk+1. The selection of interaction times (tk,tk+1) has a notable impact on the collected trajectories, while the interval tk+1 tk establishes an upper limit on the duration of training ( t). This interval profoundly influences the suboptimality gap of the policy. In the context of a time-desynchronized environment, achieving an optimal policy requires addressing a previously unexplored question: the determination of the optimal time sequence {t1,t2,..,.t K}(= {t}1 K) at which the agent should interact with the environment. We elucidate the significance of the aforementioned research question through an example. Consider a robot with the goal of reaching inside a gray-shaded non-fixed target box, known as the goal reacher (Appendix A.1). Note that the reward changes as the position of the box changes over time (Figure 1-(a)). We begin by considering a scenario in which the wall-clock time and episode are synchronized, wherein the environment evolves alongside the episode. During each episode k, the agent rollouts a trajectory and iteratively updates the policy N times, with the assumption that one policy update requires one second, and then the agent transitions to the subsequent episode k + 1. In conventional non-stationary RL environments, it is evident that a larger value of N provides an advantage in terms of a faster adaptation to achieve a near-optimal policy. However, regardless of the chosen value of N, the agent will consistently encounter the same environment in the subsequent episode. Now, consider a scenario in which the wall-clock time and episode are desynchronized. In this context, given a fixed wall-clock time duration t [0,10], the agent is faced with the additional task of determining both the total number of interactions (denoted as the total episode K) and the specific time instances for these interactions {t}1 K, where tk [0,10],tk 1 < tk for k [K]. Figure 1(a) shows an agent that interacts with the environment ten times, that is, {t}1 K = {1,2,...,10}, and spends the time interval (tk,tk+1) to train the policy, which consumes one second (K = 10,N = 1). The high frequency of interaction (K = 10) provides adequate data for precise future box position learning (t = 11), yet a single policy update (N = 1) may not approximate the optimal policy. Now, if the agent interacts with the environment four times, i.e. {t}1 K = {1,4,7,10} (see Figure 1(b)), it becomes feasible to train the policy over a duration of three seconds (K = 4,N = 3). A longer period of policy training (N = 3) helps the agent in obtaining a near-optimal policy. However, limited observation data (K = 4) and large time intervals (t {11,12,13}) may lead to less accurate box predictions. This example underscores the practical importance of aligning the interaction time of the agent with the environment in non-stationary RL. Determining the optimal sequence {t}1 K involves a trade-off between achieving an optimal model and an optimal policy. Based on the previous example, our key insight is that, in non-stationary environments, the new factor tempo emerges. Informally, tempo refers to the pace of processes occurring in a non-stationary environment. We define environment tempo as how fast the environment changes and agent tempo as how frequently it updates the policy. Despite the importance of considering the tempo to find the optimal {t}1 K, the existing formulations and methods for non-stationarity RL are insufficient. None of the existing works has adequately addressed this crucial aspect. Our framework, Pro ST, provides a solution to finding the optimal {t}1 K by computing a minimum solution to an upper bound on its performance measure. It proactively optimizes the time sequence by leveraging the agent tempo and the environment tempo. The Pro ST framework is divided into two components: future policy optimizer (OPTπ) and time optimizer (OPTt), and is characterized by three key features: 1) it is proactive in nature as it forecasts the future MDP model; 2) it is model-based as it optimizes the policy in the created MDP; and 3) it is a synchronizing tempo framework as it finds a suboptimal training time by adjusting how many times the agent needs to update the policy (agent tempo) relative to how fast the environment changes (environment tempo). Our framework is general in the sense that it can be equipped with any common algorithm for policy update. Compared to the existing works [7 9], our approach achieves higher rewards and a more stable performance over time (see Figure 1(c) and Section 5). We analyze the statistical and computational properties of Pro ST in a tabular MDP, which is named Pro ST-T. Our framework learns in a novel MDP, namely elapsed time-varying MDP, and quantifies its non-stationarity with a novel metric, namely time-elapsing variation budget, where both consider wall-clock time taken. We analyze the dynamic regret of Pro ST-T (Theorem 1) into two components: dynamic regret of OPTπ that learns a future MDP model (Proposition 1) and dynamic regret of OPTt that computes a near-optimal policy in that model (Theorem 2, Proposition 2). We show that both regrets satisfy a sublinear rate with respect to the total number of episodes regardless of the agent tempo. More importantly, we obtain suboptimal training time by minimizing an objective that strikes a balance between the upper bounds of those two dynamic regrets, which reflect the tempos of the agent and the environment (Theorem 3). We find an interesting property that the future MDP model error of OPTπ serves as a common factor on both regrets and show that the upper bound on the dynamic regret of Pro ST-T can be improved by a joint optimization problem of learning both different weights on observed data and a model (Theorem 4, Remark 1). Finally, we introduce Pro ST-G, which is an adaptable learning algorithm for high-dimensional tasks achieved through a practical approximation of Pro ST. Empirically, Pro ST-G provides solid evidence on different reward returns depending on policy training time and the significance of learning the future MDP model. Pro ST-G also consistently finds a near-optimal policy, outperforming four popular RL baselines that are used in non-stationary environments on three different Mujoco tasks. The sets of natural, real, and non-negative real numbers are denoted by N,R, and R+, respectively. For a finite set Z, the notation Z denotes its cardinality and the notation (Z) denotes the probability simplex over Z. For X N, we define [X] ={1,2,..,X}. For a variable X, we denote X as a forecasted (or predicted) variable at the current time, and X as the observed value in the past. Also, for any time variable t > 0 and k N, we denote the time sequence {t1,t2,..,tk} as {t}1 k , and variable X at time tk as Xtk. We use the shorthand notation X(k)(or X(k)) for Xtk(or Xtk). We use the notation {x}a b to denote a sequence of variables {xa,xa+1,...,xb}, and {x}(a b) to represent a sequence of variables {xta,xta+1,...,xtb}. Given two variables x and y, let x y denote max(x,y), and x y denote min(x,y). Given two complex numbers z1 and z2, we write z2 = W(z1) if z2ez2 = z1, where W is the Lambert function. Given a variable x, the notation a = O(b(x)) means that a C b(x) for some constant C > 0 that is independent of x, and the notation a = Ω(b(x)) means that a C b(x) for some constant C > 0 that is independent of x. We have described the specific details in Appendix C.1. 2 Problem statement: Desynchronizing timelines 2.1 Time-elapsing Markov Decision Process In this paper, we study a non-stationary Markov Decision Process (MDP) for which the transition probability and the reward change over time. We begin by clarifying that the term episode is agentcentric, not environment-centric. Prior solutions for episode-varying (or step-varying) MDPs operate under the assumption that the timing of MDP changes aligns with the agent commencing a new episode (or step). We introduce the new concept of time-elapsing MDP. It starts from the wall-clock time t = 0 to t = T, where T is fixed. The time-elapsing MDP at time t [0,T] is defined as Mt = S,A,H,Pt,Rt,γ , where S is the state space, A is the action space, H is the number of steps, Pt S A S (S) is the transition probability , Rt S A R is the reward function, and γ is a discounting factor. Prior to executing the first episode, the agent determines the total number of interactions with the environment (denoted as the number of total episode K) and subsequently computes the sequence of interaction times {t}1 K through an optimization problem. We denote tk as the wall-clock time of the environment when the agent starts the episode k. Similar to the existing non-stationary RL framework, the agent s objective is to learn a policy πtk S (A) for all k. This is achieved through engaging in a total of K episode interactions, namely {Mt1,Mt1,...,Mt K}, where the agent dedicates the time interval (tk,tk+1) for policy training and then obtains a sequence of suboptimal policies {πt1,πt2,...,πt K} to maximize a non-stationary policy evaluation metric, dynamic regret. Dealing with time-elapsing MDP instead of conventional MDP raises an important question that should be addressed: within the time duration [0,T], what time sequence {t}1 K yields favorable trajectory samples to obtain an optimal policy? This question is also related to the following: what is optimal value of K, i.e. the total number of episode that encompasses a satisfactory balance between the amount of observed trajectories and the accuracy of policy training? These interwined questions are concerned with an important aspect of RL, which is the computation of the optimal policy for a given tk. In Section 4, we propose the Pro ST framework that computes a suboptimal K and its corresponding suboptimal time sequence {t }1 K based on the information of the environment s rate of change. In Section 3, we compute a near-optimal policy for {t }1 K . Before proceeding with the above results, we introduce a new metric quantifying the environment s pace of change, referred to as time-elapsing variation budget. 2.2 Time-elapsing variation budget Variation budget [10 12] is a metric to quantify the speed with which the environment changes. Driven by our motivations, we introduce a new metric imbued with real-time considerations, named time-elapsing variation budget B( t). Unlike the existing variation budget, which quantifies the environment s non-stationarity across episodes {1,2,..,K}, our definition accesses it across {t1,t2,...,t K}, where the interval t = tk+1 tk remains constant regardless of k [K 1]. For further analysis, we define two time-elapsing variation budgets, one for transition probability and another for reward function. Definition 1 (Time-elapsing variation budgets). For a given sequence {t1,t2,..,t K}, assume that the interval t is equal to the policy training time π. We define two time-elapsing variation budgets Bp( π) and Br( π) as Bp( π) = K 1 k=1 sup s,a Ptk+1( s,a) Ptk( s,a) 1, Br( π) = K 1 k=1 sup s,a Rtk+1(s,a) Rtk(s,a) . To enhance the representation of a real-world system using the time-elapsing variation budgets, we make the following assumption. Assumption 1 (Drifting constants). There exist constants c > 1 and αr,αp 0 such that Bp(c π) cαp Bp( π) and Br(c π) cαr Br( π). We call αp and αr the drifting constants. 2.3 Suboptimal training time Aside from the formal MDP framework, the agent can be informed of varying time-elapsing variation budgets based on the training time π (0,T) even within the same time-elapsing MDP. Intuitively, a short time π is inadequate to obtain a near-optimal policy, yet it facilitates frequent interactions with the environment, leading to a reduction in empirical model error due to a larger volume of data. On the contrary, a long time π may ensure obtaining a near-optimal policy but also introduces greater uncertainty in learning the environment. This motivates us to find a suboptimal training time π (0,T) that strikes a balance between the sub-optimal gap of the policy and the empirical model error. If it exists, then π provides a suboptimal K = T/ π , and a suboptimal time sequence where t k = t1 + π (k 1) for all k [K ]. Our Pro ST framework computes the parameter π, then sets {t }1 K , and finally finds a future near-optimal policy for time t k+1 at time t k. In the next section, we first study how to approximate the one-episode-ahead suboptimal policy π ,tk+1 at time tk when {t}1 K is given. 3 Future policy optimizer Figure 2: Pro ST framework For given tk and tk+1, the future policy optimizer (OPTπ), as a module of the Pro ST framework (Figure 2), computes a near-optimal policy for the future time tk+1 at time tk via two procedures: (i) it first forecasts the future MDP model of time tk+1 at time tk utilizing the MDP forecaster function, (ii) it then utilizes an arbitrary policy optimization algorithm within the forecasted MDP model OPTπ to obtain a future near-optimal policy π .tk+1. 3.1 MDP forecaster Our Pro ST framework is applicable in an environment that meets the following assumption. Assumption 2 (Observable non-stationary set O). Assume that the non-stationarity of Mtk is fully characterized by a non-stationary parameter otk O. Assume also that the agent observes a noisy non-stationary parameter otk at the end of episode k [K] (at time tk). It is worth noting that Assumption 2 is mild, as prior research in non-stationary RL has proposed techniques to estimate o(k) through latent factor identification methods [4, 13 16], and our framework accommodates the incorporation of those works for the estimation of o(k). Based on Assumption 2, we define the MDP forecaster function g f below. Definition 2 (MDP forecaster g f). Consider two function classes F and G such that F Ow O and G S A O R (S), where w N. Then, for f(k) F and g(k) G, we define MDP forecaster at time tk as (g f)(k) Ow S A R (S). The function f(k), acting as a non-stationarity forecaster, predicts a non-stationary parameter ˆo(k+1) at time tk+1 based on the last w observations given by the set { o}(k w+1 k), i.e., ˆo(k+1) = f({ o}(k w+1,k)). The agent can determine the number of used historical observations, denoted as w, by leveraging information from the environment (Section 4). Then, the function g(k), acting as a model predictor, predicts a reward R(k+1)(s,a) and a transition probability P(k+1)( s,a) for time tk+1, i.e., ( R(k+1), P(k+1)) = g(k)(s,a, ˆok+1). Finally, the OPTπ generates the estimated future MDP M(k+1) = S,A,H, P(k+1), R(k+1),γ associated with time tk+1. 3.2 Finding future optimal policy Now, consider an arbitrary RL algorithm provided by the user to obtain an optimal policy from the model M(k+1). For a given time sequence {t}1 K, the OPTπ finds a near-optimal future policy as follows: (1) observe and forecast, (2) optimize using the future MDP model. (1) Observe and forecast. At time tk, the agent executes an episode k in the environment M(k), completes its trajectory τ(k), and observes the noisy non-stationary parameter ˆo(k) (Assumption 2). The algorithm then updates the function f(k) based on the last w observed parameters, and the function g(k) with input from all previous trajectories. Following these updates, the MDP forecaster at time tk predicts P(k+1) and R(k+1), thus creating the MDP model M(k+1) for time tk+1. (2) Optimize using the future MDP model. Up until time tk+1, the agent continually updates the policy within the estimated future MDP M(k+1) for a given duration π. Specifically, the agent rollouts synthetic trajectories ˆτ(k+1) in M(k+1), and utilizes any policy update algorithm to obtain a policy π(k+1). Following the duration π, the agent stops the training by the time tk+1 and moves to the next episode M(k+1) with policy π(k+1). We elaborate on the above procedure in Algorithm 1 given in Appendix F.1. 4 Time optimizer 4.1 Theoretical analysis We now present our main theoretical contribution, which is regarding the time optimizer (OPTt): computing a suboptimal policy training time π (the agent tempo). Our theoretical analysis starts with specifying the components of the OPTπ optimizer, which we refer to as Pro ST-T (note that -T stands for an instance in the tabular setting). We employ the Natural Policy Gradient (NPG) with entropy regularization [17] as a policy update algorithm in OPTπ. We denote the entropy regularization coefficient as τ, the learning rate as η, the policy evaluation approximation gap arising due to finite samples as δ, and the past reference length for forecaster f as w. Without loss of generality, we assume that each policy iteration takes one second. The theoretical analysis is conducted within a tabular environment, allowing us to relax Assumption 2, which means that one can estimate non-stationary parameters by counting visitation of state and action pairs at time tk, denoted as n(k)(s,a), rather than observing them. Additionally, we incorporate the exploration bonus term at time tk into R(k+1), denoted as Γ(k) w (s,a), which is proportional to k τ=k w+1(n(τ)(s,a)) 1/2 and aims to promote the exploration of states and actions that are visited infrequently. We compute π by minimizing an upper bound on the Pro ST-T s dynamic regret. The dynamic regret of Pro ST-T is characterized by the model prediction error that measures the MDP forecaster s error by defining the difference between M(k+1) and M(k+1) through a Bellman equation. Definition 3 (Model prediction error). At time tk, the MDP forecaster predicts a model M(k+1) and then we obtain a near-optimal policy π(k+1) based on M(k+1). For each pair (s,a), we denote the state value function and the state action value function of π(k+1) in M(k+1) at step h [H] as V (k+1) h (s) and Q(k+1) h (s,a), respectively. We also denote the model prediction error associated with time tk+1 calculated at time tk as ι(k+1) h (s,a), which is defined as ι(k+1) h (s,a) = (R(k+1) + γP(k+1) V (k+1) h+1 Q(k+1) h )(s,a). We now derive an upper bound on the Pro ST-T dynamic regret. We expect the upper bound to be likely controlled by two factors: the error of the MDP forecaster s prediction of the future MDP model and the error of the NPG algorithm due to approximating the optimal policy within an estimated future MDP model. This insight is clearly articulated in the next theorem. Theorem 1 (Pro ST-T dynamic regret R). Let ιK H = K 1 k=1 H 1 h=0 ι(k+1) h (s(k+1) h ,a(k+1) h ) and ιK = K 1 k=1 ιk+1 , where ιK H is a data-dependent error. For a given p (0,1), the dynamic regret of the forecasted policies { π(k+1)}1 K 1 of Pro ST-T is upper bounded with probability at least 1 p/2 as follows: R({ π(k+1)}1 K 1,K)) RI + RII where RI = ιK /(1 γ) ιK H+Cp K 1, RII = CII[ π] (K 1), and Cp,CII[ π] are some functions of p, π, respectively. Specifically, the upper bound is composed of two terms: RI that originates from the MDP forecaster error between M(k+1) and M(k+1), and RII that arises due to the suboptimality gap between π ,(k+1) and π(k+1). Theorem 1 clearly demonstrates that a prudent construction of the MDP forecaster that controls the model prediction errors and the selection of the agent tempo π is significant in guaranteeing sublinear rates for RI and RII. To understand the role of the environment tempo in RI, we observe that the MDP forecaster utilizes w previous observations, which inherently encapsulates the environment tempo. We expect the model prediction errors, at least in part, to be controlled by the environment tempo B( π), so that a trade-off between two tempos can be framed as the trade-off between RI and RII. Hence, it is desirable to somehow minimize the upper bound with respect to π to obtain a solution, denoted as π , which strikes a balance between RI and RII. 4.1.1 Analysis of RII A direct analysis of the upper bound RI + RII is difficult since its dependence on K is not explicit. To address this issue, we recall that an optimal π should be a natural number that guarantees the sublinearity of both RI and RII with respect to the total number of episodes K. We first compute a set NII N that includes those values of π that guarantee the sublinearity of RII, and then compute a set NI N that guarantees the sublinearity of RI. Finally, we solve for π in the common set NI NII. Proposition 1 ( π bounds for sublinear RII). A total step H is given by MDP. For a number ϵ > 0 such that H = Ω(log (( rmax rmax)/ϵ)), we choose δ,τ,η to satisfy δ = O (ϵ), τ = Ω(ϵ/log A ) and η (1 γ)/τ, where rmax and rmax are the maximum reward of the forecasted model and the maximum reward of the environment, respectively. Define NII ={n n > 1 ητ log ( C1(γ+2) ϵ ),n N}, where C1 is a constant. Then RII 4ϵ(K 1) for all π NII. As a by-product of Proposition 1, the sublinearity of RII can be realized by choosing ϵ = O((K 1)α 1) for any α [0,1), which suggests that a tighter upper bound on RII requires a smaller ϵ and subsequently a larger π NII. The hyperparameter conditions in Proposition 1 can be found in Lemma 1 and 2 in Appendix D.3. 4.1.2 Analysis of RI We now relate RI to the environment tempo B( π) using the well-established non-stationary adaptation technique of Sliding Window regularized Least-Squares Estimator (SW-LSE) as the MDP forecaster [18 20]. The tractability of the SW-LSE algorithm allows to upper-bound the model predictions errors ιK H and ιK by the environment tempo extracted from the past w observed trajectories, leading to a sublinear RI as demonstrated in the following theorem. Theorem 2 (Dynamic regret RI with f = SW-LSE). For a given p (0,1), if the exploration bonus constant β and regularization parameter λ satisfy β = Ω( S H log (H/p)) and λ 1, then RI is bounded with probability at least 1 p as follows: RI CI[B( π)] w + Ck 1 w log (1 + H where CI[B( π)] = (1/(1 γ) + H) Br( π) + (1 + Hˆrmax)γ/(1 γ) Bp( π), and Ck is a constant on the order of O(K). For a brief sketch of how SW-LSE makes the environment tempo appear in the upper bound, we outline that the model prediction errors are upper-bounded by two forecaster errors, namely P(k+1) - P(k+1) and R(k+1) R(k+1), along with the visitation count n(k)(s,a). Then, the SW-LSE algorithm provides a solution ( P(k+1), R(k+1)) as a closed form of linear combinations of past w estimated values { P, R}(k w+1 w). Finally, employing the Cauchy inequality and triangle inequality, we derive two forecasting errors that are upper-bounded by the environment tempo. For the final step before obtaining a suboptimal π, we compute NI that guarantees the sublinearity of RI. Proposition 2 ( π bounds for sublinear RI). Denote B(1) as the environment tempo when π = 1, which is a summation over all time steps. Assume that the environment satisfies Br(1) + Bp(1)ˆrmax/(1 γ) = o(K) and we choose w = O((K 1)2/3/(CI[B( π)])2/3). Define the set NI to be {n n < K, n N}. Then RI is upper-bounded as RI = O (CI[B( π)]1/3 (K 1)2/3 log ((K 1)/CI[B( π)])) and also satisfies a sublinear upper bound, provided that π NI. The upper bound on the environment tempo B(1) in proposition 2 is aligned with our expectation that dedicating an excessively long time to a single iteration may not allow for an effective policy approximation, thereby hindering the achievement of a sublinear dynamic regret. Furthermore, our insight that a larger environment tempo prompts the MDP forecaster to consider a shorter past reference length, aiming to mitigate forecasting uncertainty, is consistent with the condition involving w stated in Proposition 2. 4.1.3 Suboptimal tempo π So far, we have shown that an upper bound on the Pro ST dynamic regret is composed of two terms RI and RII, which are characterized by the environment tempo and the agent tempo, respectively. Now, we claim that a suboptimal tempo that minimizes Pro ST s dynamic regret could be obtained by the optimal solution π = arg min π NI NII (Rmax I + Rmax II ), where Rmax I and Rmax II denote the upper bounds on RI and RII. We show that π strikes a balance between the environment tempo and the agent tempo since Rmax I is a non-decreasing function of π and Rmax II is a non-increasing function of π. Theorem 3 shows that the optimal tempo π depends on the environment s drifting constants introduced in Assumption 1. Theorem 3 (Suboptimal tempo π). Let k Env = (αr αp)2 CI[B(1)], k Agent = log (1/(1 ητ))C1(K 1)(γ + 2). Consider three cases: case1: αr αp = 0, case2: αr αp = 1, case3: 0 < αr αp < 1 or αr αp > 1. Then π depends on the environment s drifting constants as follows: Case1: π = T. Case2: π = log1 ηγ (k Env/k Agent) + 1. Case3: π = exp( W [ log (1 ητ) max (αr,αp) 1]), provided that the parameters are chosen so that k Agent = (1 ητ)k Env. 4.2 Improving MDP forecaster Determining a suboptimal tempo by minimizing an upper bound on RI + RII can be improved by using a tighter upper bound. In Proposition 1, we focused on the Q approximation gap δ to provide a justifiable upper bound on RI + RII. It is important to note that the factor δ arises not only from the finite sample trajectories as discussed in [21], but also from the forecasting error between M(k+1) and M(k+1). It is clear that the MDP forecaster establishes a lower bound on δ denoted as δmin, which in turn sets a lower bound on ϵ and consequently on RI. This inspection highlights that the MDP forecaster serves as a common factor that controls both RI and RII, and a further investigation to improve the accuracy of the forecaster is necessary for a better bounding on RI + RII. Our approach to devising a precise MDP forecaster is that, instead of selecting the past reference length w as indicated in Proposition 2, we set w = k, implying the utilization of all past observations. However, we address this by solving an additional optimization problem, resulting in a tighter bound on RI. We propose a method that adaptively assigns different weights q Rk + to the previously observed non-stationary parameters up to time tk, which reduces the burden of choosing w. Hence, we further analyze RI through the utilization of the Weighted regularized Least-Squares Estimator (W-LSE) [22]. Unlike SW-LSE, W-LSE does not necessitate a predefined selection of w, but it instead engages in a joint optimization procedure involving the data weights q and the future model ( P(k+1), R(k+1)). To this end, we define the forecasting reward model error as r k(s,a) = (R(k+1) R(k+1))(s,a) and the forecasting transition probability model error as p k(s,a) = (P(k+1) P(k+1))( s,a) 1. Theorem 4 (RI upper bound with f=W-LSE). By setting the exploration bonus Γ(k)(s,a) = 1 2 r k(s,a) + γ rmax 2(1 γ) p k(s,a), it holds that RI (4H + 2γ S 1 γ ( 1 1 γ + H))(1 K 1 k=1 r k(s,a) + γ rmax 2(1 γ) K 1 k=1 p k(s,a)). Remark 1 (Tighter RI upper bound with f = W-LSE). If the optimization problem of W-LSE is feasible, then the optimal data weight q provides tighter bounds for r k and p k in comparison to SW-LSE, consequently leading to a tighter upper bound for RI. We prove in Lemmas 4 and 6 in Appendix D.3 that ιK and ιK H are upper-bounded in terms of r k and p k. 4.3 Pro ST-G The theoretical analysis outlined above serves as a motivation to empirically investigate two key points: firstly, the existence of an optimal training time; secondly, the role of the MDP forecaster s contribution to the Pro ST framework s overall performance. To address these questions, we propose a practical instance, named Pro ST-G, which particularly extends the investigation in Section 4.2. Pro ST-G optimizes a policy with the soft actor-critic (SAC) algorithm [23], utilizes the integrated autoregressive integrated moving average (ARIMA) model for the proactive forecaster f, and uses a bootstrap ensemble of dynamic models where each model is a probabilistic neural network for the model predictor g. We further discuss specific details of Pro ST-G in Appendix F.3 and in Algorithm 3. 5 Experiments We evaluate Pro ST-G with four baselines in three Mujoco environments each with five different non-stationary speeds and two non-stationary datasets. (1) Environments: Non-stationary desired posture. We make the rewards in the three environments non-stationary by altering the agent s desired directions. The forward reward Rf t changes as Rf t = ot s Rf t , where s Rf is the original reward from the Mujoco environment. The non-stationary parameter ok is generated from the sine function with five different speeds and from the real data A and B. We then measure the time-elapsing variation budget by K 1 k=1 ok+1 ok . Further details of the environment settings can be found in Appendix D.1.1. (2) Benchmark methods. Four baselines are chosen to empirically support our second question: the significance of the forecaster. MBPO is the state-of-the-art model-based policy optimization [24]. Pro-OLS is a policy gradient algorithm that predicts the future performance and optimizes the predicted performance of the future episode [7]. ONPG is an adaptive algorithm that performs a purely online optimization by fine-tuning the existing policy using only the trajectory observed online [8]. FTRL is an adaptive algorithm that performs follow-the-regularized-leader optimization by maximizing the performance on all previous trajectories [9]. 6 Discussions 6.1 Performance compare The outcomes of the experimental results are presented in Table 1. The table summarizes the average return over the last 10 episodes during the training procedure. We have illustrated the complete training results in Appendix E.3. In most cases, Pro ST-G outperforms MBPO in terms of rewards, highlighting the adaptability of the Pro ST framework to dynamic environments. Furthermore, except for data A and B, Pro ST-G consistently outperforms the other three baselines. This supports our motivation of using the proactive model-based method for a higher adaptability in non-stationary environments compared to state-of-the-art model-free algorithms (Pro-OLS, ONPG, FTRL). We elaborate on the training details in Appendix E.2. Table 1: Average reward returns Speed B(G) Swimmer-v2 Halfcheetah-v2 Hopper-v2 Pro-OLS ONPG FTML MBPO Pro ST-G Pro-OLS ONPG FTML MBPO Pro ST-G Pro-OLS ONPG FTML MBPO Pro ST-G 1 16.14 -0.40 -0.26 -0.08 -0.08 0.57 -83.79 -85.33 -85.17 -24.89 -19.69 98.38 95.39 97.18 92.88 92.77 2 32.15 0.20 -0.12 0.14 -0.01 1.04 -83.79 -85.63 -86.46 -22.19 -20.21 98.78 97.34 99.02 96.55 98.13 3 47.86 -0.13 0.05 -0.15 -0.64 1.52 -83.27 -85.97 -86.26 -21.65 -21.04 97.70 98.18 98.60 95.08 100.42 4 63.14 -0.22 -0.09 -0.11 -0.04 2.01 -82.92 -84.37 -85.11 -21.40 -19.55 98.89 97.43 97.94 97.86 100.68 5 77.88 -0.23 -0.42 -0.27 0.10 2.81 -84.73 -85.42 -87.02 -20.50 -20.52 97.63 99.64 99.40 96.86 102.48 A 8.34 1.46 2.10 2.37 -0.08 0.57 -76.67 -85.38 -83.83 -40.67 83.74 104.72 118.97 115.21 100.29 111.36 B 4.68 1.79 -0.72 -1.20 0.19 0.20 -80.46 -86.96 -85.59 -29.28 76.56 80.83 131.23 110.09 100.29 127.74 6.2 Ablation study An ablation study was conducted on the two aforementioned questions. The following results support our inspection of Section 4.2 and provide strong grounds for Theorem 3. Figure 3: (a) Optimal π; (b) Different forecaster f (ARIMA, SA); (c) The Mean squared Error (MSE) model loss of Pro ST-G with four different forecasters (ARIMA and three SA) and the MBPO. The x-axis in each figure shows the episodes. Suboptimal π. The experiments are performed over five different policy training times π {1,2,3,4,5}, aligned with SAC s number of gradient steps G {38,76,114,152,190}, under a fixed environment speed. Different from our theoretical analysis, we set t = 1 with G = 38. We generate ok = sin(2π πk/37), which satisfies Assumption 1 (see Appendix E.1). The shaded areas of Figures 3 (a), (b) and (c) are 95 % confidence area among three different noise bounds of 0.01,0.02 and 0.03 in ok. Figure 3(a) shows t = 4 is close to the optimal G among five different choices. Functions f,g. We investigate the effect of the forecaster f s accuracy on the framework using two distinct functions: ARIMA and a simple average (SA) model, each tested with three different the values of w. Figure 3(b) shows the average rewards of the SA model with w {3,5,7} and ARIMA model (four solid lines). The shaded area is 95 % the confidence area among 4 different speeds {1,2,3,4}. Figure 3(c) shows the corresponding model error. Also, we investigate the effect of the different model predictor g by comparing MBPO (reactive-model) and Pro ST-G with f =ARIMA (proactive-model) in Figure 3(c). The high returns from Pro ST-G with f = ARIMA, compared to those from MBPO, empirically support that the forecasting component of the Pro ST framework can provide a satisfactory adaptability to the baseline algorithm that is equipped with. Also, Figures 3(b) and 3(c) provide empirical evidence that the accuracy of f is contingent on the sliding window size, thereby impacting the model accuracy and subsequently influencing the agent s performance. 7 Conclusion This work offers the first study on the important issue of time synchronization for non-stationary RL. To this end, we introduce the concept of the tempo of adaptation in a non-stationary RL, and obtain a suboptimal training time. We propose a Proactively Synchronizing Tempo (Pro ST) framework, together with two specific instances Pro ST-T and Pro ST-G. The proposed method adjusts an agent s tempo to match the tempo of the environment to handle non-stationarity through both theoretical analysis and empirical evidence. The Pro ST framework provides a new avenue to implement reinforcement learning in the real world by incorporating the concept of adaptation tempo. As a future work, it is important to generalize the proposed framework to learn a safe guarantee policy in a non-stationary RL by considering the adaptation tempo of constraint violations [25, 26]. Another generalization is to introduce an alternative dynamic regret metric, enabling a fair performance comparison among agents, even when they have varying numbers of total episodes. Another future work is to find an optimal tempo of the distribution correction in offline non-stationary RL, specifically how to adjust the relabeling function to offline data in a time-varying environment that is dependent on the tempo of the environment [27, 28]. [1] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. Playing atari with deep reinforcement learning. ar Xiv preprint ar Xiv:1312.5602, 2013. [2] David Silver, Aja Huang, Chris J. Maddison, Arthur Guez, L. Sifre, George van den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Vedavyas Panneershelvam, Marc Lanctot, Sander Dieleman, Dominik Grewe, John Nham, Nal Kalchbrenner, Ilya Sutskever, Timothy P. Lillicrap, Madeleine Leach, Koray Kavukcuoglu, Thore Graepel, and Demis Hassabis. Mastering the game of go with deep neural networks and tree search. Nature, 529:484 489, 2016. [3] Gabriel Dulac-Arnold, Daniel Mankowitz, and Todd Hester. Challenges of real-world reinforcement learning. In International Conference on Machine Learning, 2019. [4] Luisa Zintgraf, Sebastian Schulze, Cong Lu, Leo Feng, Maximilian Igl, Kyriacos Shiarlis, Yarin Gal, Katja Hofmann, and Shimon Whiteson. Varibad: Variational bayes-adaptive deep rl via meta-learning. Journal of Machine Learning Research, 22(289):1 39, 2021. [5] Zhuangdi Zhu, Kaixiang Lin, Anil K. Jain, and Jiayu Zhou. Transfer learning in deep reinforcement learning: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45 (11):13344 13362, 2023. [6] Sindhu Padakandla. A survey of reinforcement learning algorithms for dynamically varying environments. ACM Computing Surveys (CSUR), 54(6):1 25, 2021. [7] Yash Chandak, Georgios Theocharous, Shiv Shankar, Martha White, Sridhar Mahadevan, and Philip Thomas. Optimizing for the future in non-stationary mdps. In International Conference on Machine Learning, pages 1414 1425. PMLR, 2020. [8] Maruan Al-Shedivat, Trapit Bansal, Yuri Burda, Ilya Sutskever, Igor Mordatch, and Pieter Abbeel. Continuous adaptation via meta-learning in nonstationary and competitive environments. In International Conference on Learning Representations, 2018. [9] Chelsea Finn, Aravind Rajeswaran, Sham Kakade, and Sergey Levine. Online meta-learning. In International Conference on Machine Learning, pages 1920 1930. PMLR, 2019. [10] Yuhao Ding, Ming Jin, and Javad Lavaei. Non-stationary risk-sensitive reinforcement learning: Near-optimal dynamic regret, adaptive detection, and separation design. Proceedings of the AAAI Conference on Artificial Intelligence, 37(6):7405 7413, 2022. [11] STEVEN J BRADTKE. Linear least-squares algorithms for temporal difference learning. Machine Learning, 22:33 57, 1996. [12] Yonatan Gur, Assaf J. Zeevi, and Omar Besbes. Stochastic multi-armed-bandit problem with non-stationary rewards. In NIPS, 2014. [13] Xiaoyu Chen, Xiangming Zhu, Yufeng Zheng, Pushi Zhang, Li Zhao, Wenxue Cheng, Peng CHENG, Yongqiang Xiong, Tao Qin, Jianyu Chen, and Tie-Yan Liu. An adaptive deep rl method for non-stationary environments with piecewise stable context. In Advances in Neural Information Processing Systems, volume 35, pages 35449 35461, 2022. [14] Biwei Huang, Fan Feng, Chaochao Lu, Sara Magliacane, and Kun Zhang. Adarl: What, where, and how to adapt in transfer reinforcement learning. In International Conference on Learning Representations, 2022. [15] Fan Feng, Biwei Huang, Kun Zhang, and Sara Magliacane. Factored adaptation for nonstationary reinforcement learning. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors, Advances in Neural Information Processing Systems, volume 35, pages 31957 31971, 2022. [16] Jeongyeol Kwon, Yonathan Efroni, Constantine Caramanis, and Shie Mannor. Rl for latent mdps: Regret guarantees and a lower bound. Advances in Neural Information Processing Systems, 34:24523 24534, 2021. [17] Sham M Kakade. A natural policy gradient. Advances in neural information processing systems, 14, 2001. [18] Wang Chi Cheung, David Simchi-Levi, and Ruihao Zhu. Learning to optimize under nonstationarity. In The 22nd International Conference on Artificial Intelligence and Statistics, pages 1079 1087. PMLR, 2019. [19] Wang Chi Cheung, David Simchi-Levi, and Ruihao Zhu. Hedging the drift: Learning to optimize under nonstationarity. Management Science, 68(3):1696 1713, 2022. [20] Wang Chi Cheung, David Simchi-Levi, and Ruihao Zhu. Reinforcement learning for nonstationary markov decision processes: The blessing of (more) optimism. In International Conference on Machine Learning, pages 1843 1854. PMLR, 2020. [21] Shicong Cen, Chen Cheng, Yuxin Chen, Yuting Wei, and Yuejie Chi. Fast global convergence of natural policy gradient methods with entropy regularization. Operations Research, 70(4): 2563 2578, 2022. [22] Vitaly Kuznetsov and Mehryar Mohri. Theory and algorithms for forecasting time series. ar Xiv preprint ar Xiv:1803.05814, 2018. [23] Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Offpolicy maximum entropy deep reinforcement learning with a stochastic actor. In International conference on machine learning, pages 1861 1870. PMLR, 2018. [24] Michael Janner, Justin Fu, Marvin Zhang, and Sergey Levine. When to trust your model: Model-based policy optimization. Advances in neural information processing systems, 32, 2019. [25] Ming Jin and Javad Lavaei. Stability-certified reinforcement learning: A control-theoretic perspective. IEEE Access, 8:229086 229100, 2020. [26] Samuel Pfrommer, Tanmay Gautam, Alec Zhou, and Somayeh Sojoudi. Safe reinforcement learning with chance-constrained model predictive control. In Learning for Dynamics and Control Conference, pages 291 303. PMLR, 2022. [27] Jongmin Lee, Cosmin Paduraru, Daniel J Mankowitz, Nicolas Heess, Doina Precup, Kee-Eung Kim, and Arthur Guez. Coptidice: Offline constrained reinforcement learning via stationary distribution correction estimation. International Conference on Learning Representations, 2022. [28] Jongmin Lee, Wonseok Jeon, Byungjun Lee, Joelle Pineau, and Kee-Eung Kim. Optidice: Offline policy optimization via stationary distribution correction estimation. In International Conference on Machine Learning. PMLR, 2021. [29] Danijar Hafner, Jurgis Pasukonis, Jimmy Ba, and Timothy Lillicrap. Mastering diverse domains through world models. ar Xiv preprint ar Xiv:2301.04104, 2023. [30] Weichao Mao, Kaiqing Zhang, Ruihao Zhu, David Simchi-Levi, and Tamer Ba sar. Model-free non-stationary rl: Near-optimal regret and applications in multi-agent rl and inventory control. ar Xiv preprint ar Xiv:2010.03161, 2020. [31] Yuhao Ding and Javad Lavaei. Provably efficient primal-dual reinforcement learning for cmdps with non-stationary objectives and constraints. AAAI 23/IAAI 23/EAAI 23. AAAI Press, 2023. ISBN 978-1-57735-880-0. [32] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves, Martin A. Riedmiller, Andreas Kirkeby Fidjeland, Georg Ostrovski, Stig Petersen, Charlie Beattie, Amir Sadik, Ioannis Antonoglou, Helen King, Dharshan Kumaran, Daan Wierstra, Shane Legg, and Demis Hassabis. Human-level control through deep reinforcement learning. Nature, 518:529 533, 2015. [33] Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning. International Conference on Learning Representations, 2016. [34] Wen Sun, Geoffrey J Gordon, Byron Boots, and J Bagnell. Dual policy iteration. Advances in Neural Information Processing Systems, 31, 2018. [35] Yuping Luo, Huazhe Xu, Yuanzhi Li, Yuandong Tian, Trevor Darrell, and Tengyu Ma. Algorithmic framework for model-based deep reinforcement learning with theoretical guarantees. International Conference on Learning Representations, 2019. [36] Yingjie Fei, Zhuoran Yang, Zhaoran Wang, and Qiaomin Xie. Dynamic regret of policy optimization in non-stationary environments. Advances in Neural Information Processing Systems, 33:6743 6754, 2020. [37] Yash Chandak, Scott Jordan, Georgios Theocharous, Martha White, and Philip S Thomas. Towards safe policy improvement for non-stationary mdps. Advances in Neural Information Processing Systems, 33:9156 9168, 2020. [38] Alekh Agarwal, Sham M Kakade, Jason D Lee, and Gaurav Mahajan. On the theory of policy gradient methods: Optimality, approximation, and distribution shift. The Journal of Machine Learning Research, 22(1):4431 4506, 2021. [39] Chi Jin, Zhuoran Yang, Zhaoran Wang, and Michael I Jordan. Provably efficient reinforcement learning with linear function approximation. In Conference on Learning Theory, pages 2137 2143. PMLR, 2020.