# fractal_landscapes_in_policy_optimization__494c0237.pdf Fractal Landscapes in Policy Optimization Tao Wang UC San Diego taw003@ucsd.edu Sylvia Herbert UC San Diego sherbert@ucsd.edu Sicun Gao UC San Diego sicung@ucsd.edu Policy gradient lies at the core of deep reinforcement learning (RL) in continuous domains. Despite much success, it is often observed in practice that RL training with policy gradient can fail for many reasons, even on standard control problems with known solutions. We propose a framework for understanding one inherent limitation of the policy gradient approach: the optimization landscape in the policy space can be extremely non-smooth or fractal for certain classes of MDPs, such that there does not exist gradient to be estimated in the first place. We draw on techniques from chaos theory and non-smooth analysis, and analyze the maximal Lyapunov exponents and Hölder exponents of the policy optimization objectives. Moreover, we develop a practical method that can estimate the local smoothness of objective function from samples to identify when the training process has encountered fractal landscapes. We show experiments to illustrate how some failure cases of policy optimization can be explained by such fractal landscapes. 1 Introduction Deep reinforcement learning has achieved much success in various applications [23, 30, 38], but they also often fail, especially in continuous spaces, on control problems that other methods can readily solve. The understanding of such failure cases is still limited. For instance, the training process of reinforcement learning is unstable and the learning curve can fluctuate during training in ways that are hard to predict. The probability of obtaining satisfactory policies can also be inherently low in reward-sparse or highly nonlinear control tasks. Existing analysis of the failures focuses on limitations of the sampling and optimization algorithms, such as function approximation errors [35, 39], difficulty in data collection [34], and aggressive updates in the policy space [28]. There has not been much study of potentially deeper causes of failures that may be inherent in the formulation of policy optimization problems. Motivated by the common observation that small updates in the policy parameters can significantly change the performance, we analyze the smoothness of the optimization landscapes in policy optimization. Drawing on chaos theory, we introduce the concept of maximal Lyapunov exponent (MLE) [17] to the RL setting to measure the exponential rate of trajectory divergence in MDP. It seems contradictory that a trajectory in chaotic systems can be both exponentially divergent and uniformly bounded at the same time, and we will show that these two conflicting facts combine to yield the fractal structure in the optimization landscape. Intuitively, the objective function is non-differentiable when the rate of trajectory divergence exceeds the decay rate of discount factor. Furthermore, this finding indicates that the fluctuations observed in the loss curve are not just due to the numerical or sampling error but rather reflect the intrinsic properties of the corresponding MDP. We should emphasize that the fractal landscapes that we will demonstrate are stronger than various existing results on the non-smoothness [2, 7]. Most non-smooth objectives that have been studied still assume is local Lipschitz continuity or piecewise smoothness that implies differentiability almost everywhere (such as f(x) = |x|). Instead, by showing that the loss landscape of policy optimization can be fractal, we demonstrate the absence of descent directions, which causes the failure of first-order 37th Conference on Neural Information Processing Systems (Neur IPS 2023). methods in general. Since such behavior is an intrinsic property of the underlying dynamical systems, the results show fundamental limitations of policy gradient methods on certain classes of MDPs. The paper is organized as follows. In Section 3 and 4, we will introduce the preliminaries and develop the theory for deterministic policies. In particular, we show that the optimization landscape is fractal, even when all elements within the MDP are deterministic. Next, we consider stochastic policies and provide an example to show how non-smoothness can still occur if without additional assumptions. In Section 5, we turn the theoretical analysis into a practical sampling-based method for estimating the Hölder exponent to determine whether the optimization objective is differentiable at a specific parameter vector. It can also indicate if the training process has encountered fractal regions by comparing the regression slope with some fixed threshold. In Section 6, we show experiments that demonstrate and compare the landscapes of different MDPs. 2 Related work Policy gradient and Q-learning methods. Policy gradient methods [33, 41] formulate RL as an optimization problem in the parameter space, with many variations such as natural policy gradient [16], deterministic policy gradient [29], deep deterministic policy gradient [18], trust region policy optimization [27] and proximal policy optimization [28], were proposed. As all of these algorithms aim to estimate the gradient of the objective function over the policy parameters, they become ill-posed when the objective is non-differentiable, which is the focus of our analysis. Another popular approach for model-free RL is Q-learning methods, which approximate the Qfunction of the policy at each step [22, 40]. As neural networks become more and more popular, they are employed as function approximators in deep Q-learning algorithms [9, 13, 37]. Since the foundation of Q-learning methods is established upon the estimation of value functions, a poor approximation can completely ruin the entire training process. In this paper, we will show that the value functions in a certain class of MDPs exhibit significant non-smoothness, making them challenging to represent using existing methods. Chaos in machine learning. Chaotic behaviors due to randomness in the learning dynamics have been reported in other learning problems [6, 21, 25]. For instance, when training recurrent neural networks for a long period, the outcome behaves like a random walk due to the problems of vanishing and the exploding gradients [4]. It served as motivation for the work [24], which points out that the chaotic behavior in finite-horizon model-based reinforcement learning problems may be caused by long chains of nonlinear computation. A similar observation was made in [31]. However, we show that in RL, the objective function is provably smooth if the time horizon is finite and the underlying dynamics is differentiable. Instead, we focus on the general context of infinite-horizon problems in MDPs, in which case the objective function can become non-differentiable. Loss landscape of policy optimization. It has been shown that the objective functions in finite state-space MDPs are smooth [1, 42], which enables the use of gradient-based methods and direct policy search. It also explains why the classical RL algorithms in [32] are provably efficient in finite space settings. Also, such smoothness results can be extended to some continuous state-space MDPs with special structures. For instance, the objective function in Linear Quadratic Regulator (LQR) problems is almost smooth [10] as long as the cost is finite. Similar results are obtained for the H2/H problem [43]. For the robust control problem, although the objective function may not be smooth, it is locally Lipschitz continuous, which implies differentiability almost everywhere, and further leads to global convergence of direct policy search [11]. There is still limited theoretical study of loss landscapes of policy optimization for nonlinear and complex MDPs. We aim to partially address this gap by pointing out the possibility that the loss landscape can be highly non-smooth and even fractal, which is far more complex than the previous cases. 3 Preliminaries 3.1 Dynamical Systems as Markov Decision Processes We consider Markov Decision Processes (MDPs) that encode continuous control problems for dynamical systems defined by difference equations of the form: st+1 = f(st, at), (1) where st S Rn is the state at time t, s0 is the initial state and at πθ( |st) A Rm is the action taken at time t based on a policy parameterized by θ Rp.We assume that both the state space S and the action space A are compact. The objective function of the RL problem to minimize is defined by V πθ of policy πθ: J(θ) = V πθ(s0) = Eat πθ( |st)[ t=0 γtc(st, at)], (2) where γ (0, 1) is the discount factor and c(s, a) is the cost function. The following assumptions are made throughout this paper: (A.1) f : Rn Rm Rn is Lipschitz continuous over any compact domains (i.e., locally Lipschitz continuous); (A.2) The cost function c : Rn Rm R is non-negative and locally Lipschitz continuous everywhere; (A.3) The state space is closed under transitions, i.e., for any (s, a) S A, the next state s = f(s, a) S. 3.2 Policy gradient methods Policy gradient methods estimate the gradient of the objective J( ) with respect to the parameters of the policies. A commonly used form is J(θ) = Eat πθ( |st)[ θ log πθ(at|st) Aπθ(st, at)], (3) where πθ( | ) is a stochastic policy parameterized by θ. Aπθ(s, a) = Qπθ(s, a) V πθ(s) is the advantage function often used for variance reduction and Qπθ( , ) is the Q-value function of πθ. The theoretical guarantee of the convergence of policy gradient methods is typically established by the argument that the tail term γt θV πθ(s) diminishes as t increases, for any s S [33]. For such claims to hold, two assumptions are needed: θV πθ(s) exists and is continuous for all s S; θV πθ(s) is uniformly bounded over S. The second assumption is automatically satisfied if the first assumption holds in the case that S is either finite or compact. However, as we will see in Section 4 and 6, the existence of θV πθ( ) may fail in many continuous MDPs even if S is compact, which challenges the fundamental well-posedness of policy gradient methods. 3.3 Maximal Lyapunov Exponents Behaviors of chaotic systems have sensitive dependence on their initial conditions. To be precise, consider the system st+1 = F(st) with initial state s0 Rn, and suppose that a small perturbation Z0 is made to s0. The divergence from the original trajectory of the system under this perturbation at time t, say Z(t), can be estimated by Z(t) eλt Z0 with some λ that is called the Lyapunov exponent. For chaotic systems, Lyapunov exponents are typically positive, which implies an exponential divergence rate of the separation of nearby trajectories [19]. Since the Lyapunov exponent at a given point may depend on the direction of the perturbation Z0, and we are interested in identifying the largest divergence rate, the maximal Lyapunov exponent (MLE) is formally defined as follows: Definition 3.1. (Maximal Lyapunov exponent) For the dynamical system st+1 = F(st), s0 Rn, the maximal Lyapunov exponent λmax at s0 is defined as the largest value such that λmax = lim sup t lim sup Z0 0 Note that systems with unstable equilibria, not necessarily chaotic, can have positive MLEs. 3.4 Fractal Landscapes The Hausdorff dimension is the most fundamental concept in fractal theory. We first introduce the concept of δ-cover and Hausdorff measure: Definition 3.2. (δ-cover) Let {Ui} be a countable collection of sets of diameter at most δ (i.e. |Ui| = sup{ x y : x, y Ui} δ) and F RN, then {Ui} is a δ-cover of F if F i=1Ui. Definition 3.3. (Hausdorff measure) For any F RN and s 0, let Hs δ(F) = inf{ i=1 |Ui|s : {Ui} is a δ-cover of F}. Then we call the limit Hs(F) = limδ 0 Hs δ(F) the s-dimensional Hausdorff measure of F. The definition of Hausdorff dimension follows immediately: Definition 3.4. (Hausdorff dimension) Let F RN be a subset, then its Hausdorff dimension dim H F = inf{s 0 : Hs(F) = 0} = sup{s 0 : Hs(F) = }. And we introduce the notion of α-Hölder continuity that extends the concept of Lipschitz continuity: Definition 3.5. (α-Hölder continuity) Let α > 0 be a scalar. A function g : RN R is α-Hölder continuous at x RN if there exist C > 0 and δ > 0 such that |g(x) g(y)| C x y α for all y B(x, δ), where B(x, δ) denotes the open ball of radius δ centered at x. The definition reduces to Lipschitz continuity when α = 1. A function is not differentiable, if the largest Hölder exponent at a given point is less than 1. Just as smoothness is commonly associated with Lipschitz continuity, fractal behavior is closely related to Hölder continuity. In particular, for an open set F Rk and a continuous mapping η : F Rp with p > k, the image set η(F) is fractal when its Hausdorff dimension dim H η(F) is strictly greater than k, which occurs when η : F Rp is α-Hölder continuous with exponent α < 1: Proposition 3.1. ([8]) Let F Rk be a subset and suppose that η : F Rp is α-Hölder continuous where α > 0, then dim H η(F) 1 It implies that if the objective function is α-Hölder for some α < 1, its loss landscape LJ = {(θ, J(θ)) RN+1 : θ RN} can be fractal. Further discussion of the theory on fractals can be found in Appendix C. 4 Fractal Landscapes in the Policy Space In this section, we will show that the objective J(θ) in policy optimization can be non-differentiable when the system has positive MLEs. We will first consider Hölder continuity of V πθ( ) and J( ) with deterministic policies in 4.1 and 4.2, and then discuss the case of stochastic policies in 4.3. 4.1 Hölder Exponent of V πθ( ) We first consider a deterministic policy πθ that maps states to actions a = πθ(s) instead of distributions. Consider a fixed policy parameter θ Rp such that the MLE of (1), namely λ(θ), is greater than log γ. Let s 0 S be another initial state that is close to s0, i.e., δ = s 0 s0 > 0 is small enough. According to the assumption (A.3) and the compactness of the state space, we can find a constant M > 0 such that both st M and s t M for all t N, where {st} t=1 and {s t} t=1 are the trajectories starting from s0 and s 0, respectively. Motivated by (4), we further make the following assumptions: (A.4) There exists K1 > 0 such that s t st K1δeλ(θ)t for all t N and δ = s 0 s0 > 0. (A.5) The policy π : RN Rn Rm is locally Lipschitz continuous everywhere. We then have following theorem, and it provides a lower bound for the Hölder exponent of J whose detailed proof can be found in Appendix B.1. Theorem 4.1. (Non-smoothness of V πθ) Assume (A.1)-(A.5) and the parameterized policy πθ( ) is deterministic. Let λ(θ) denote the MLE of (1) at θ RN. Suppose that λ(θ) > log γ, then V πθ( ) is log γ λ(θ) -Hölder continuous at s0. Proof sketch of Theorem 4.1: Suppose that p (0, 1] is some constant for which we would like to prove that V πθ(s) is p-Hölder continuous at s = s0, and here we take p = log γ According to Definition 3.5, it suffices to find some C > 0 such that |V πθ(s 0) V πθ(s0)| C δp when δ = s0 s 0 1. Consider the relaxed form |V πθ(s 0) V πθ(s0)| t=0 γt|c(st, πθ(st)) c(s t, πθ(s t))| C δp. (5) Now we split the entire series into three parts as shown in Figure 1: the sum of first T2 terms, the sum from t = T2 + 1 to T3 1, and the sum from t = T3 to . First, applying (A.4) to the sum of the first T2 terms yields t=0 γt|c(st, πθ(st)) c(s t, πθ(s t))| e(λ(θ)+log γ)T2 1 γ K1K2δ, (6) where K2 > 0 is the Lipschitz constant obtained by (A.2) and (A.5). If we wish to bound the right-hand side of (6) by some term of order O(δp) when δ 1, the length T2(δ) N should satisfy T2(δ) C1 + p 1 λ(θ) + log γ log(δ), (7) where C1 > 0 is some constant independent of p and δ. Figure 1: An illustration of the two series (7) and (9) that need to cover the entire R when δ 0. Next, for the sum of the tail terms in V πθ( ) starting from T3 N, it is automatically bounded by t=T3 γt|c(st, πθ(st)) c(s t, πθ(s t))| 2M2e T3 log γ where M2 = maxs S c(s, πθ(s)) is the maximum of continuous function c( , πθ( )) over the compact domain S (and hence exists). if we bound the right-hand side of (8) by a term of order O(δp), it yields T3(δ) C2 + p log γ log(δ), (9) for some independent constant C2 > 0. Since the sum of (6) and (8) provides a good estimate of V πθ only if T3(δ) T2(δ) N0 for some N0 > 0 as δ 0, otherwise there would be infinitely many terms in the middle as δ 0 that cannot be controlled by any O(δp) terms. In this case, we have (C2 C3) + ( p log γ p 1 λ(θ) + log γ ) log(δ) N0, (10) as log(δ) , which implies that the slopes satisfy the inequality p log γ p 1 λ(θ) + log γ 0, (11) where the equality holds when p = log γ λ(θ) . Thus, V πθ(s) is log γ λ(θ) -Hölder continuous at s = s0. On the other hand, the following counterexample shows that Theorem 4.1 has provided the strongest Hölder-continuity result for V πθ(s) at s = s0 under the assumptions (A.1)-(A.5): Example 4.1. Consider a one-dimensional MDP st+1 = f(st, at) where 1, a 1, a, 1 < a < 1, 1, a 1, (12) with state space S = [ 1, 1] and cost function c(s, a) = |s|. Let the policy be linear, namely πθ(s) = θ s and θ R. It can be verified that all assumptions (A.1)-(A.5) are satisfied. Now let s0 = 0 and θ > 1, then applying (4) directly yields λ(θ) = lim sup t lim sup Z0 0 Z0 = lim sup t lim sup Z0 0 t log Z0 θt Z0 = log θ. Let δ > 0 be sufficiently small, then t=0 δγtθt = t=0 δγtθt + t=T0(δ) γt γT0(δ) where T0(δ) = 1 + log δ log θ N and is the flooring function. Therefore, we have |V πθ(δ) V πθ(0)| = V πθ(δ) γ 1 γ = γ 1 γ δ Remark 4.1. Another way to see why it is theoretically impossible to prove p-Hölder continuity for V πθ for any p > log γ λ(θ) : notice that the inequality (10) no longer holds as log δ since p log γ p 1 λ(θ) + log γ < 0. Thus, p = log γ λ(θ) is the largest Hölder exponent of V πθ( ) that can be proved in the worst case. Remark 4.2. The value function V πθ(s) is Lipschitz continuous at s = s0 when the maximal Lyapunov exponent λ(θ) < log γ, since there exists a constant K such that |V πθ(s 0) V πθ(s0)| t=0 γt|c(st, πθ(st)) c(s t, πθ(s t))| t=0 γt K δeλ(θ)t t=0 e(λ(θ)+log γ)t K δ 1 e(λ(θ)+log γ) where δ = s0 s 0 > 0 is the difference in the initial state. 4.2 Hölder Exponent of J( ) The following lemma establishes a direct connection between J(θ) and J(θ ) through value functions: Lemma 4.1. Suppose that θ, θ Rp, then V πθ (s0) V πθ(s0) = t=0 γt(Qπθ(sθ t , πθ (sθ t )) V πθ(sθ t )) where {sθ t } t=0 is the trajectory generated by the policy πθ ( ). The proof can be found in the Appendix B.2. Notice that indeed we have J(θ ) = V πθ (s0) and J(θ) = V πθ(s0), substituting with these two terms in the previous lemma and doing some calculations lead to the following main theorem whose proof can be found in the Appendix B.3: Theorem 4.2. (Non-smoothness of J) Assume (A.1)-(A.5) and the parameterized policy πθ( ) is deterministic. Let λ(θ) denote the MLE of (1) at θ Rp. Suppose that λ(θ) > log γ, then J( ) is log γ λ(θ) -Hölder continuous at θ. Remark 4.3. In fact, the set of assumptions (A.1)-(A.5) is quite general and does not exclude the case of constant cost functions c(s, a) const, which always results in a smooth landscape regardless of the underlying dynamics, even though they are rarely used in practice. However, recall that the log γ λ(θ) -Hölder continuity is a result of exponential divergence of nearby trajectories, when a cost function can continuously distinguish two separate trajectories (e.g., quadratic costs) with a discount factor close to 1, the landscape will be fractal as shown in Section 6. Another way to see it is to look into the relaxation in (5) where the Hölder continuity is obtained from the local Lipschitz continuity of c(s, a), i.e., |c(s, πθ(s)) c(s , πθ(s ))| K2 s s . Therefore, the Hölder continuity is tight if for any δ > 0, there exists s 0 B(s0, δ) such that |c(st, πθ(st)) c(s t, πθ(s t))| K3 st s t with some K3 > 0 for all t N. We will leave the further investigation for future studies. The following example illustrates how the smoothness of loss landscape changes with λ(θ) and γ: Example 4.2. (Logistic model) Consider the following MDP: st+1 = (1 st)at, s0 = 0.9, (13) where the policy at is given by deterministic linear function at = πθ(st) = θst. The objective function is defined as J(θ) = P t=0 γt (s2 t + 0.1 a2 t) where γ (0, 1) is the discount factor. It is well-known that (13) begins to exhibit chaotic behavior with positive MLEs (as shown in Figure 2a) when θ 3.3 [15], so we plot the graphs of J(θ) for different discount factors over the interval θ [3.3, 3.9]. From Figure 2b to 2d, the non-smoothness becomes more and more significant as γ grows. In particular, Figure 2e shows that the value of J(θ) fluctuates violently even within a very small interval of θ, suggesting a high degree of non-differentiability in this region. (a) MLE λ(θ). (b) γ = 0.5. (c) γ = 0.9. (d) γ = 0.99. (e) Magnified. Figure 2: The value of MLE λ(θ) for θ [3.3, 3.9] is shown in 2a. The graph of objective function J(θ) for different values of γ are shown in 2b-2e where J(θ) is estimated by the sum of first 1000 terms in the infinite series. 4.3 Stochastic Policies The original MDP (1) becomes stochastic when a stochastic policy is employed. First, let us consider the slightly modified version of MLE for stochastic policies: λmax = lim sup t lim sup Z0 0 t log Eπ[ Zω(t)] where Z0 = s 0 s0 is a small pertubation made to the initial state and Zω(t) = s t(ω) st(ω) is the difference in the sample path at time t N and sample ω Ω. Since this definition is consistent with that in (4) when sending the variance to 0, we use the same notation λ(θ) to denote the MLE at given θ RN and again assume λ(θ) > log γ. Since policies in most control and robotics environments are deterministic, this encourages the variance to converge to 0 during training. However, unlike the deterministic case where the Hölder continuity result was proved under the assumption that the policy πθ(s) is locally Lipschitz continuous, stochastic policies instead provide a probability distribution from which the action is sampled. Thus, a stochastic policy cannot be locally Lipschitz continuous in θ when approaching its deterministic limit. For instance, consider the one-dimensional Gaussian distribution πθ(a|s) where θ = [µ, σ]T denotes the mean and variance. As the variance σ approaches 0, πθ(a|s) becomes more and more concentrated at a = µs, and eventually converges to the Dirac delta function δ(a µs), which means that πθ(a|s) cannot be Lipschitz continuous within a neighborhood of any θ = [µ, 0]T even though its deterministic limit πθ(s) = µs is indeed Lipschitz continuous. The following example illustrates that in this case, the Hölder exponent of the objective function J( ) can still be less than 1: Example 4.3. Suppose that the one-dimensional MDP st+1 = f(st, at) where f(s, a) is defined as in (12) over the state space S = [ 1, 1] and action space A = [0, ). The cost function is c(s, a) = s + 1. Also, the parameter space is θ = [θ1, θ2]T R2 and the policy πθ( |s) U(|θ1|s+|θ2|, |θ1|s+2|θ2|) is a uniform distribution. It is easy to verify that all required assumptions are satisfied. Let the initial state s0 = 0 and θ1 > 1, θ2 = 0, then applying (14) directly yields λ(θ) = log θ1 similarly as in Example 4.1. Now suppose that θ 2 > 0 is small and θ = [θ1, θ 2]T , then for any ω Ωin the sample space, the sampled trajectory {s t} generated by πθ has s t+1(ω) θ1s t(ω) + θ 2 > θ1s t(ω) θt 1s 1(ω) θt 1(θ 2) when s t+1(ω) < 1. Thus, we have s t+1(ω) = 1 for all ω Ωand t T0(θ ) = 1 + log θ 2 log θ1 , which further leads to J(θ ) = 1 1 γ + t=0 γt Eπθ [s t] J(θ) + t=T0(δ) γt Eπθ [s t] γ 1 γ (θ 2) using the fact that J(θ) = 1 1 γ . Plugging θ θ = θ 2 into the above inequality yields J(θ ) J(θ) γ 1 γ θ θ log θ1 . (15) where the Hölder exponent is again log γ λ(θ) as in Example 4.1. Remark 4.4. Consider the 1-Wasserstein distance as defined in [36] between the distribution δ(a µs) and U(|θ1|s + |θ2|, |θ1|s + 2|θ2|), which is given by W1(θ1, θ2) = 3|θ2| 2 . It is Lipschitz continuous at θ2 = 0, even though the non-smooth result in (15) holds. Therefore, probability distribution metrics, such as the Wasserstein distance, are too "coarse" to capture the full fractal nature of the objective function. This also suggests that further assumptions regarding the pointwise smoothness of probability density functions are necessary to create a smooth landscape with stochastic policies, even though they may exclude the case of σ 0 as discussed earlier. 5 Estimating Hölder Exponents from Samples In the previous sections, we have seen that the objective function J(θ) can be highly non-smooth and thus gradient-based methods may not work well in the policy parameter space. The question is: how can we determine whether the objective function J(θ) is differentiable at some θ = θ0 or not in high-dimensional settings? Note that J(θ) may have different levels of smoothness along different directions. To address it, we propose a statistical method to estimate the Hölder exponent. Consider the objective function J(θ) and a probability distribution whose variance is finite. Consider the isotropic Gaussian distribution X N(θ0, σ2Ip) where Ip is the p p identity matrix. For continuous objective function J( ), then its variance matrix can be expressed as V ar(J(X)) = EX N(θ0,σ2I)[J(X) EX N(θ0,σ2Ip)[J(X)])2] = EX N(θ0,σ2Ip)[(J(X) J(ξ ))2] where ξ Rp is obtained from applying the intermediate value theorem to EX N(θ0,σ2Ip)[J(X)] and hence not a random variable. If J( ) is locally Lipschitz continuous at θ0, say |J(θ) J(θ0)| K θ θ0 for some K > 0 when θ θ0 is small, then it has the following approximation V ar(J(X)) K2EX N(θ0,σ2Ip)[ X ξ 2] (V ar(X))2 O(σ2) (16) when σ 1. Therefore, (16) provides a way to directly determine whether the Hölder exponent of J( ) at any given θ Rp is less than 1, especially when the dimension p is large. In particular, taking the logarithm on both sides of (16) yields log V arσ(J(X)) C + 2 log σ (17) for some constant C where the subscript in V arσ(J(X)) indicates its dependence on the standard deviation σ of X. Thus, the log-log plot of V arσ(J(X)) versus σ is expected to be close to a straight line with slope k 2 when J(θ) is locally Lipschitz continuous around θ = θ0. Therefore, one can determine the smoothness by sampling around θ0 with different variances and estimating the slope via linear regression. Usually, J(θ) is Lipschitz continuous at θ = θ0 when the slope k is close to or greater than 2, and it is non-differentiable if the slope is less than 2. 6 Experiments In this section, we will validate the theory presented in this paper through common RL tasks. All environments are adopted from The Open AI Gym Documentation [5] with continuous control input. The experiments are conducted in two steps: first, we randomly sample a parameter θ0 from a Gaussian distribution and estimate the gradient η(θ0) from (3); second, we evaluate J(θ) at θ = θ0 + δη(θ0) for each small δ > 0. According to our results, the loss curve is expected to become smoother as γ decreases, since smaller γ makes the Hölder exponent log γ λ(θ) larger. In the meantime, the policy gradient method (3) should give a better descent direction while the true objective function J( ) becoming smoother. Notice that a single sample path can always be non-smooth when the policy is stochastic and hence interferes the desired observation, we use stochastic policies to estimate the gradient in (3), and apply their deterministic version (by setting variance equal to 0) when evaluating J(θ). Regarding the infinite series, we use the sum of first 1000 terms to approximate J(θ). The stochastic policy is given by πθ( |s) N(u(s), σ2Ip) where the mean u(s) is represented by the 2-layer neural network u(s) = W2 tanh(W1s) where W1 Mr n(R) and W2 Mm r(R) are weight matrices. Let θ = [W1, W2]T denote the vectorized policy parameter. For the width of the hidden layer, we use r = 8 for the inverted pendulum and acrobot, and r = 64 for the hopper. (a) γ = 0.9, deterministic. (b) γ = 0.9, stochastic. (c) γ = 0.99, deterministic. (d) γ = 0.99, stochastic. (e) y = 1.980x + 2.088. Figure 3: The experimental results of inverted pendulum. In 3e, the linear regression result is obtained for γ = 0.9. The loss curves J(θ) are presented in 3a-3d where θ = θ0 + δη(θ0) with step size 10 7. Inverted Pendulum. The inverted pendulum task is a standard test case for RL algorithms, and here we use it as an example of non-chaotic system. The initial state is always taken as s0 = [ 1, 0]T ([0, 0]T is the upright position), and quadratic cost function c(s, a) = s T t Qst + 0.001 at 2, where Q = diag(1, 0.1) is a 2 2 diagonal matrix, st R2 and at R. The initial parameter is given by θ0 N(0, 0.052 I). In Figure 4a and 4c, we see that the loss curve is close to a straight line within a very small interval, which indicates the local smoothness of θ0. It is validated by the estimate of the Hölder exponent of J(θ) at θ = θ0 which is based on (16) by sampling many parameters around θ0 with different variance. In Figure 3e, the slope k = 1.980 is very closed to 2 so Lipschitz continuity (and hence differentiability) is verified at θ = θ0. As a comparison, the loss curve of single random sample path is totally non-smooth as shown in Figure 3b and 3d. Acrobot. The acrobot system is well-known for its chaotic behavior and hence we use it as the main test case. Here we use the cost function c(s, a) = s T t Qst + 0.005 at 2, where Q = diag(1, 1, 0.1, 0.1), st R4 and at R. The initial state is s0 = [1, 0, 0, 0]T . The initial parameter is again sampled from θ0 N(0, 0.052 I). From Figure 4a-4c, the non-smoothness grows as γ increases and finally becomes completely non-differentiable when γ = 0.99 which is the most common value used for discount factor. It partially explains why the acrobot task is difficult to policy gradient methods. In Figure 4e, the Hölder exponent of J(θ) at θ = θ0 is estimated as α 0.43155 < 1, which further indicates non-differentiability around θ0. (a) γ = 0.8, deterministic. (b) γ = 0.9, deterministic. (c) γ = 0.99, deterministic. (d) γ = 0.99, stochastic. (e) y = 0.8631x + 2.041 Figure 4: The experimental results of acrobot. In Figure 4e, the linear regression result is obtained for γ = 0.9. The loss curves J(θ) are presented in 4a-4d where θ = θ0 + δη(θ0) with step size 10 7. Hopper. Now we consider the Hopper task in which the cost function is defined c(s, a) = (1.25 s[0]) + 0.001 a 2, where s[0] is the first coordinate in s R11 which indicates the height of hopper. Because the number of parameters involved in the neural network is larger, the initial parameter is instead sampled from θ0 N(0, 102 I). As we see that in Figure 5a, the loss curve is almost a straight line when γ = 0.8, and it starts to exhibit non-smoothness when γ = 0.9 and becomes totally non-differentiable when γ = 0.99. A supporting evidence by the Hölder exponent estimation is provided in Figure 5e where the slope is far less than 2. (a) γ = 0.8, deterministic. (b) γ = 0.9, deterministic. (c) γ = 0.99, deterministic. (d) γ = 0.99, stochastic. (e) y = 0.5036x 1.250. Figure 5: The experimental results of hopper. In Figure 5e, the linear regression result is obtained for γ = 0.9. The loss curves J(θ) are presented in 5a-5d where θ = θ0 + δη(θ0) with step size 10 3. 7 Conclusion In this paper, we initiate the study of chaotic behavior in reinforcement learning, especially focusing on how it is reflected on the fractal landscape of objective functions. A method to statistically estimate the Hölder exponent at some given parameter is proposed, so that one can figure out if the training process has encountered fractal landscapes or not. We believe that the theory established in this paper can help to explain many existing results in reinforcement learning, such as the hardness of complex control tasks and the fluctuating behavior of training curves. It also poses a serious question to the well-posedness of policy gradient methods given the fact that no gradient exists in many continuous state-space RL problems. Being aware of the fact that the non-smoothness of loss landscapes is an intrinsic property of the model, rather than a consequence of any numerical or statistical errors, we conjecture that the framework developed in this paper might provide new insights into the limitations of a wider range of deep learning problems beyond the realm of reinforcement learning. 8 Acknowledgements Our work is supported by NSF Career CCF 2047034, NSF AI Institute CCF 2112665, ONR YIP N00014-22-1-2292, NSF CCF DASS 2217723, and Amazon Research Award. The authors thank Zichen He, Bochao Kong and Xie Wu for insightful discussions. [1] A. Agarwal, S. M. Kakade, J. D. Lee, and G. Mahajan. On the theory of policy gradient methods: Optimality, approximation, and distribution shift. Journal of Machine Learning Research, 22(98):1 76, 2021. [2] A. Bagirov, N. Karmitsa, and M. M. Mäkelä. Introduction to Nonsmooth Optimization. Springer, 2014. [3] M. Barnsley. Fractals Everywhere. Academic Press, Inc., 1988. [4] Y. Bengio, P. Simard, and P. Frasconi. Learning long-term dependencies with gradient descent is difficult. IEEE Transactions on Neural Networks, 5(2):157 166, 1994. [5] G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman, J. Tang, and W. Zaremba. Open AI Gym. ar Xiv preprint ar Xiv:1606.01540, 2016. [6] A. Camuto, G. Deligiannidis, M. A. Erdogdu, M. Gürbüzbalaban, U. Sim sekli, and L. Zhu. Fractal structure and generalization properties of stochastic optimization algorithms. ar Xiv preprint ar Xiv:2106.04881, 2021. [7] F. H. Clarke. Methods of Dynamic and Nonsmooth Optimization. SIAM, 1989. [8] K. J. Falconer. Fractal Geometry: Mathematical Foundations and Applications. John Wiley, 1990. [9] J. Fan, Z. Wang, Y. Xie, and Z. Yang. A theoretical analysis of deep Q-learning. volume 120, pages 486 489, 2020. [10] M. Fazel, R. Ge, S. M. Kakade, and M. Mesbahi. Global convergence of policy gradient methods for the linear quadratic regulator. Proceedings of the 35th International Conference on Machine Learning, pages 1467 1476, 2018. [11] X. Guo and B. Hu. Convergence of direct policy search for state-feedback H robust control: A revisit of nonsmooth synthesis with Goldstein subdifferential. ar Xiv preprint ar Xiv:2210.11577, 2022. [12] G. H. Hardy. Weierstrass s non-differentiable function. Transactions of the American Mathematical Society, 17:301 325, 1916. [13] T. Hester, M. Vecerik, O. Pietquin, M. Lanctot, T. Schaul, B. Piot, D. Horgan, J. Quan, A. Sendonaris, I. Osband, G. Dulac-Arnold, J. Agapiou, J. Leibo, and A. Gruslys. Deep Qlearning from demonstrations. Proceedings of the AAAI Conference on Artificial Intelligence, 32(1), 2018. [14] M. W. Hirsch, S. Smale, and R. L. Devaney. Differential Equations, Dynamical Systems, and an Introduction to Chaos. Academic Press, 2013. [15] D. W. Jordan and P. Smith. Nonlinear Ordinary Differential Equations: An introduction for Scientists and Engineers. Oxford University Press, 2007. [16] S. M. Kakade. A natural policy gradient. Advances in Neural Information Processing Systems 14, pages 1531 1538, 2001. [17] H. Kantz. A robust method to estimate the maximal Lyapunov exponent of a time series. Physics Letter A, 185:77 87, 1994. [18] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra. Continuous control with deep reinforcement learning. ar Xiv preprint ar Xiv:1509.02971, 2015. [19] E. N. Lorenz. The Essence of Chaos. University of Washington Press, 1995. [20] S. W. Mc Donald, C. Grebogia, E. Ott, and J. A. Yorke. Fractal basin boundaries. Physica, 17D:125 153, 1985. [21] L. Metz, C. D. Freeman, S. S. Schoenholz, and T. Kachman. Gradients are not all you need. 2021. [22] J. Millán, D. Posenato, and E. Dedieu. Continuous-action Q-learning. Machine Learning, 49:247 265, 2002. [23] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, and et al. Human-level control through deep reinforcement learning. Nature, 518:529 533, 2015. [24] P. Parmas, C. E. Rasmussen, J. Peters, and K. Doya. PIPPS: Flexible model-based policy search robust to the curse of chaos. Proceedings of the 35th International Conference on Machine Learning, pages 4065 4074, 2018. [25] R. Pascanu, T. Mikolov, and Y. Bengio. On the difficulty of training recurrent neural networks. Proceedings of the 30th International Conference on Machine Learning, pages 1310 1318, 2013. [26] D. Preiss. Geometry of measures in Rn: Distribution, rectifiability, and densities. Annals of Mathematics, 125(3):537 643, 1987. [27] J. Schulman, S. Levine, P. Abbeel, M. Jordan, and P. Moritz. Trust region policy optimization. Proceedings of the 32nd International Conference on Machine Learning, pages 1889 1897, 2015. [28] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. Proximal policy optimization algorithms. ar Xiv preprint ar Xiv:1707.06347, 2017. [29] D. Silver, G. Lever, N. Heess, T. Degris, D. Wierstra, and M. Riedmiller. Deterministic policy gradient algorithms. Proceedings of the 31st International Conference on Machine Learning, pages 387 395, 2014. [30] D. Silver, J. Schrittwieser, K. Simonyan, I. Antonoglou, A. Huang, A. Guez, T. Hubert, L. Baker, M. Lai, A. Bolton, Y. Chen, T. Lillicrap, F. Hui, L. Sifre, G. Driessche, T. Graepel, and D. Hassabis. Mastering the game of Go without human knowledge. Nature, 550:354 359, 2017. [31] H. Suh, M. Simchowitz, K. Zhang, and R. Tedrake. Do differentiable simulators give better policy gradients? Proceedings of the 39th International Conference on Machine Learning, 162:20668 20696, 2022. [32] R. S. Sutton and A. Barto. Reinforcement Learning: an Introduction. MIT Press, 1998. [33] R. S. Sutton, D. Mc Allester, S. Singh, and Y. Mansour. Policy gradient methods for reinforcement learning with function approximation. Advances in Neural Information Processing Systems 12, pages 1057 1063, 1999. [34] P. Thomas and E. Brunskill. Data-efficient off-policy policy evaluation for reinforcement learning. Proceedings of The 33rd International Conference on Machine Learning, pages 2139 2148, 2016. [35] J. N. Tsitsiklis and B. V. Roy. Analysis of temporal-difference learning with function approximation. Advances in Neural Information Processing Systems 9, pages 1075 1081, 1996. [36] S. S. Vallender. Calculation of the Wasserstein distance between probability distributions on the line. Theory of Probability and its Applications, 18(4):784 786, 1974. [37] H. van Hasselt, A. Guez, and D. Silver. Deep reinforcement learning with double Q-learning. Proceedings of the AAAI Conference on Artificial Intelligence, 30(1), 2016. [38] O. Vinyals, I. Babuschkin, W. Czarnecki, and et al. Grandmaster level in Star Craft II using multi-agent reinforcement learning. Nature, 575:350 354, 2019. [39] R. Wang, D. P. Foster, and S. M. Kakade. What are the statistical limits of offline RL with linear function approximation? ar Xiv preprint ar Xiv:2010.11895, 2020. [40] C. J. C. H. Watkins and P. Dayan. Q-learning. Machine Learning, 8:279 292, 1992. [41] R. J. Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 8:229 256, 1992. [42] L. Xiao. On the convergence rates of policy gradient methods. ar Xiv preprint ar Xiv:2201.07443, 2022. [43] K. Zhang, B. Hu, and T. Ba sar. Policy optimization for H2 linear control with H robustness guarantee: Implicit regularization and global convergence. SIAM Journal on Control and Optimization, 59(6):4081 4109, 2021. A A Brief introduction to chaos theory As mentioned in the Introduction, chaos exists in many systems in the real world. Although no universal definition of chaos can be made, there are, indeed, three features that a chaotic system usually possesses [14]: Dense periodic points; Topological transitivity; Sensitive dependence on initial conditions; In some cases, some of these properties imply the others. It is important to note that, despite the appearance of chaos is always accompanied by high unpredictability, the chaotic behavior is entirely deterministic and is not a consequence of randomness. Another interesting fact is that trajectories in a chaotic system are usually bounded, which drives us to think about the convergence of policy gradient methods beyond the boundedness of state spaces. (a) Lorenz attractor. (b) Rössler attractor. Figure 6: The Lorenz system and Rössler system are standard examples of chaotic systems, in which a small perturbation in the initial state can result in a significant divergence in the entire trajectory. Actually, it can be summarized from the results in this paper that for a given MDP, the following three features contribute most to its chaotic behavior: Infinite time horizon (t ); Continuous state space ( Z0 0); Exponential divergence (λmax > 0); Since these features are not necessarily bound to certain types of continuous state-space MDPs, it would be exciting for future studies to investigate other types of MDPs using the framework developed in the paper. B Proofs omitted in Section 4 B.1 Proof of Theorem 4.1 Proof. Suppose that s 0 S is another initial state close to s0 and δ = s0 s 0 . Let T1 N be the smallest integer that satisfies T1 1 λ(θ) log(2M2 K1δ ), (18) where M2 = 1 + maxs S c(s, πθ(s)) > 0 is the maximum of the continuous function c( , πθ( )) over S, then applying the Lipschitz condition of c( , πθ( )) yields t=0 γt|c(st, πθ(st)) c(s t, πθ(s t))| t=0 γt K2 st s t t=0 K1K2e(λ(θ)+log γ)tδ λ(θ) log( 2M2 K1δ )+2(λ(θ)+log γ) e(λ(θ)+log γ) 1 = e2(λ(θ)+log γ)K2K λ(θ) 1 (2M2)1+ log γ e(λ(θ)+log γ) 1 δ where K2 > 0 is the Lipschitz constant of c( , πθ( )) over compact set S. On the other hand, the tail terms in J( ) is bounded by t=T1+1 γt|c(st, πθ(st)) c(s t, πθ(s t))| t=T1+1 2M2γt using |c(st, πθ(st)) c(s t, πθ(s t))| 2M2. Combining the above two inequalities yields |V πθ(s 0) V πθ(s0)| t=0 γt|c(st, πθ(st)) c(s t, πθ(s t))| t=0 γt|c(st, πθ(st)) c(s t, πθ(s t))| + t=T1+1 γt|c(st, πθ(st)) c(s t, πθ(s t))| (e2(λ(θ)+log γ)K2K λ(θ) 1 (2M2)1+ log γ e(λ(θ)+log γ) 1 + 2M2 and we complete the proof. B.2 Proof of Lemma 4.1 Proof. For the ease of notation, let st = sθ t, s t = sθ t , u(s) = πθ(s) and u (s) = πθ (s). V πθ (s0) V πθ(s0) = t=0 γtc(s t; u (s t)) V πθ(s0) t=0 γt(c(s t; u (s t)) + V πθ(s t) V πθ(s t)) V πθ(s0) t=0 γt(c(s t; u (s t)) + γV πθ(s t+1) V πθ(s t) + V πθ(s t) γV πθ(s t+1)) V πθ(s0) t=0 γt(Qπθ(s t, u (s t)) V πθ(s t)) + t=0 γt(V πθ(s t) γV πθ(s t+1)) V πθ(s0). Using the fact that γt V πθ(x t+1) 0 as t from (A.3) yields V πθ (s0) V πθ(s0) = t=0 γt(Qπθ(s t, u (s t)) V πθ(s t)) + V πθ(s 0) V πθ(s0) t=0 γt(Qπθ(s t, u (s t)) V πθ(s t)) and the proof is completed using s 0 = s0. B.3 Proof of Theorem 4.2 Proof. First, we will show that Qπθ(s, a) is log γ λ(θ) -Hölder continuous with respect to a. Note that for any given a A and any a A such that a a 1 , |Qπθ(s, a) Qπθ(s, a )| |c(s, a) c(s, a )| + γ|V πθ(f(s, a)) V πθ(f(s, a ))| K1 a a + γ f(s, a) f(s, a ) K1 a a + γK2 a a for some K3 > 0 using the locally Lipschitz continuity of c and f. Note that V πθ(s) = Qπθ(s, πθ(s)), combining it with Lemma 4.1 yields |J(θ ) J(θ)| t=0 γt|Qπθ(s t, u (s t)) V πθ(s t)| t=0 γt|Qπθ(s t, πθ (s t)) Qπθ(s t, πθ(s t))| t=0 γt K3 πθ (s t) πθ(s t) t=0 γt K3K4 θ θ using the fact that πθ(s) is Lipschitz continuous in a neighborhood of (θ, s) RN S for some constant K4 > 0 and we complete the proof. C From the perspective of fractal theory We will go through some basic concepts in fractal theory that are related to the study of non-smooth functions. C.1 The Hausdorff dimension We will show that the Hausdorff dimension is well-defined: First, it is clear that when δ < 1, Hs δ(F) is non-increasing with respect to s. Thus, Hs(F) is non-increasing as well. Let s 0 such that Hs(F) < , then for any t > s and any δ-cover {Ui} of F, we have i=1 |Ui|t δt s X which implies Ht(F) = 0 by taking infimum on both sides and letting δ 0. Therefore, the set {s 0 : 0 < Hs(F) < } contains at most one point, which further implies inf{s 0 : Hs(F) = 0} = sup{s 0 : Hs(F) = }. More details regarding the well-posedness of Hausdorff dimension can be found in [3, 8]. In particular, one can easily verify that the Hausdorff dimension coincides with the standard dimension (i.e. s N) when F is a regular manifold. Typically, the Hausdorff dimension of a fractal is not an integer, and we will be exploiting this fact through the section. A famous example is the Weierstrass function as shown in Figure 7. A comparison of Figure 2e and Figure 7c (they have the same scale) gives some sense about how non-smooth the objective function can be in practice. (a) The double sector S(x, ϕ, ψ). (b) The W(x) over x [ 2, 2]. (c) Magnified: x [0.5, 0.5001]. Figure 7: (a) shows how the double sector looks like. In (b) and (c), the Weierstrass function is given by W(x) = P n=0 an cos(bnπx) where a = 0.6, b = 7. The Hausdorff dimension of its loss curve is calculated as dim H LW = 2 + logb a 1.73. Also, according to [12], such W(x) is nowhere differentiable when 0 < a < 1 and ab 1. C.2 Non-existence of tangent plane Actually, when J( ) is Lipschitz continuous on any compact subset of RN, by the Rademacher s Theorem, we know that it is differentiable almost everywhere which implies the existence of tangent plane. As it comes to fractal landscapes, however, the tangent plane itself does not exist for almost every θ RN, which makes all policy gradient algorithms ill-posed. Although similar results were obtained for higher-dimensional cases as in [26], we focus on the two-dimensional case so that it provides a more direct geometric intuition. First, we introduce the notion of s-sets: Definition C.1. Let F R2 be a Borel set and s 0, then F is called an s-set if 0 < Hs(F) < . The intuition is that: when the dimension of fractal F is a fraction between 1 and 2, then there is no direction along which a significant part of F concentrates within a small double sector with vertex x as shown in Figure 7a. To be precise, let S(x, ϕ, ψ) denote the double sector and r > 0, then we say that F has a tangent at x F if there exists a direction ϕ such that for every angle ϕ > 0, it has 1. lim supr 0 Hs(F B(x,r)) 2. limr 0 Hs(F (B(x,r)\S(x,ϕ,ψ))) where the first condition states that the set F behaves like a fractal around x, and the second condition implies that the part of F lies outside of any double sector S(x, ϕ, ψ) is negligible when r 0. Then, the main result is as follows: Proposition C.1. (Non-existence of tangent planes, [8]) If F R2 is an s-set with 1 < s < 2, then at almost all points of F, no tangent exists. Therefore, "estimate the gradient" no longer makes sense since there does not exist a tangent line/plane at almost every point on the loss surface. This means that all policy gradient algorithms are ill-posed since there is no gradient for them to estimate at all. C.3 Accumulated uncertainty Another issue that may emerge during training process is the accumulation of uncertainty. To see how the uncertainty entered at each step accumulates and eventually blows up when generating a path along fractal boundaries, let us consider the following toy problem: Suppose that the distance between the initial point θ0 RN and the target θ is d > 0, and step size δk > 0 is adapted at the k-th step, as shown in Figure 8a. If there exists c > 0 such that the projection θ θ0, θk+1 θk cdδk for all k N which implies that the angle between the direction from θk to θk+1 and the true direction θ θ0 does not exceed arccos(c), in this case, a successful path {θk} that converges to θ should give X k=0 θ θ0, θk+1 θk = θ θ0, θ θ0 = d2 using θk θ as k , which is equivalent to P k=0 δk d (a) Generating path from θ0 to θ . (b) Update θn+1 from θn. Figure 8: Illustrations of the statistical challenges in implementing policy gradient algorithms on a fractal loss surface. On the other hand, when walking on the loss surface, it is not guaranteed to follow the correct direction precisely all the time. For any small step size δ > 0, the uncertainty fraction u(δ) involved in every single step can be estimated by the following result [20]: Proposition C.2. Let δ > 0 be the step size and β = N + 1 dim H J where dim H J is the Hausdorff dimension of loss surface of J( ), then the uncertainty u(δ) O(δβ) when δ 1. Therefore, we may assume that there exists another c > 0 such that the uncertainty Uk at the k-th step has Uk c δβ k for all k = 0, 1, .... Then, the accumulated uncertainty is bounded when β = 1 (i.e. boundary is smooth) using the earlier result P k=0 δk d c. However, the convergence of P k=0 δk no longer guarantees the convergence of P k=0 δβ k when β < 1, and a counterexample is the following series: δk = 1 k(log(k + 2))2 for all k = 0, 1, ..., which implies the uncertainty accumulated over the course of iterations may increase dramatically and eventually cause the sequence θk to become random when walking on fractal boundaries.