# escaping_the_gravitational_pull_of_softmax__b0000404.pdf

Escaping the Gravitational Pull of Softmax

Jincheng Mei 1 4 Chenjun Xiao 1 4 Bo Dai 4 Lihong Li 3

Csaba Szepesvári 2 1 Dale Schuurmans 4 1

1University of Alberta 2Deep Mind 3Amazon 4Google Research, Brain Team

{jmei2,chenjun,daes}@ualberta.ca {bodai,szepi}@google.com llh@amazon.com

The softmax is the standard transformation used in machine learning to map realvalued vectors to categorical distributions. Unfortunately, this transform poses serious drawbacks for gradient descent (ascent) optimization. We reveal this difﬁculty by establishing two negative results: (1) optimizing any expectation with respect to the softmax must exhibit sensitivity to parameter initialization ( softmax gravity well ), and (2) optimizing log-probabilities under the softmax must exhibit slow convergence ( softmax damping ). Both ﬁndings are based on an analysis of convergence rates using the Non-uniform Łojasiewicz (NŁ) inequalities. To circumvent these shortcomings we investigate an alternative transformation, the escort mapping, that demonstrates better optimization properties. The disadvantages of the softmax and the effectiveness of the escort transformation are further explained using the concept of NŁ coefﬁcient. In addition to proving bounds on convergence rates to ﬁrmly establish these results, we also provide experimental evidence for the superiority of the escort transformation.

1 Introduction

The probability transformation plays an essential role in machine learning, used whenever the output of a learned model needs to be mapped to a probability distribution. For example, in reinforcement learning (RL), a probability transformation is used to parameterize policy representations that provide a conditional distribution over a ﬁnite set of actions given an input state or observation [16]. In supervised learning (SL), particularly classiﬁcation, a probability transformation is used to parameterize classiﬁers that provide a conditional distribution over a ﬁnite set of classes given an input observation [7]. Attention models [19] also use probability transformations to provide differentiable forms of memory addressing.

Among the myriad ways one might map vectors to probability distributions, the softmax transform is the most common. For θ RK, the transformation πθ = softmax(θ) is deﬁned by πθ(a) = exp{θ(a)}/ P

a exp{θ(a )} for all a {1, ..., K}, which ensures πθ(a) > 0 and P

a πθ(a) = 1 [4]. The softmax transform can also be extended to continuous output spaces through the concept of a Gibbs function [11], but for concreteness we restrict attention to ﬁnite output sets.

Despite the ubiquity of the softmax in machine learning, it is not clear why it should be the default choice of probability transformation. Some alternative transformations have been investigated in the literature [6, 10], but a comprehensive understanding of why one choice might be advantageous over another remains incomplete. It is natural to ask what options might be available and what properties are desirable. In fact, we ﬁnd that the softmax is a particularly undesirable choice from the perspective of gradient descent (ascent) optimization. Moreover, better alternatives are readily available at no

Work done when Lihong Li was with Google. Correspondence to: Jincheng Mei.

34th Conference on Neural Information Processing Systems (Neur IPS 2020), Vancouver, Canada.

computational overhead. This paper seeks to ﬁll the gap in understanding key properties of probability transformations in general and how they compare to the softmax.

We start by considering reinforcement learning and investigate gradient ascent optimization of expected reward using the softmax transform, an algorithm we refer to as softmax policy gradient (SPG) [1, 12]. In this setting, we identify an inherent disadvantage of SPG, the softmax gravity well (SGW) , whereby gradient ascent trajectories are drawn toward suboptimal corners of the probability simplex and subsequently slowed in their progress toward the optimal vertex. We establish these facts both through theoretical analysis and empirical observation, revealing that the behavior of SPG depends strongly on initialization. Then we propose the use of the escort transform as an alternative to softmax for expected reward optimization. We analyze the resulting gradient ascent algorithm, escort policy gradient (EPG), and prove that it enjoys strictly better convergence behavior than SPG, signiﬁcantly mitigating sensitivity to initialization. These ﬁndings are also veriﬁed experimentally.

Next we consider supervised learning and investigate gradient descent optimization of cross entropy loss using the softmax transform, an algorithm we refer to as softmax cross entropy (SCE). Here, even though the optimization landscape at the output layer is convex, we identify a detrimental phenomenon we refer to as softmax damping . In particular, given deterministic ( one-hot ) true label distributions, we show that SCE achieves a slower than linear rate of convergence. Then we propose the use of the escort transform as an alternative to softmax for cross entropy minimization. We analyze the resulting gradient descent algorithm, escort cross entropy (ECE), and show that it is guaranteed to enjoy strictly faster convergence than SCE. In particular, a special choice of the escort transform fully eliminates the softmax damping phenomenon, preserving the linear convergence rate for cross entropy minimization.

Finally we propose a unifying concept, the Non-uniform Łojasiewicz (NŁ) coefﬁcient, to explain both the softmax gravity well and softmax damping, even when these might otherwise appear to be disconnected phenomena. We show that by increasing the NŁ coefﬁcient, EPG achieves strictly better initialization dependence than SPG. Moreover, by making the NŁ coefﬁcient non-vanishing, ECE enjoys strictly faster convergence than SCE.

2 Illustrating the Softmax Gravity Wells with Softmax Policy Gradient

We begin by considering the domain of reinforcement learning (RL), where the goal is to learn a policy that maximizes expected return. A core method in RL is policy gradient [17], where a parameterized policy is directly optimized to maximize long-term expected reward. It is conventional in this area to represent parametric policies using a softmax transform to produce conditional action distributions, hence policy gradient in practice is almost always the softmax policy gradient (SPG).

Despite the fact that SPG has been a dominant RL method for decades, only recently has it been proved to be globally convergent for general MDPs [1]. This result is far from obvious, since the optimization objective is not concave, nevertheless it was shown that SPG converges to the optimal policy under general conditions [1]. More recently, this result was strengthened to establish a Θ(1/t) bound on the rate of convergence [12], with constants that depend on the problem and initialization.

Although these theoretical results are general and impressive, they seem at odds with the behavior of policy gradient methods, which are notoriously difﬁcult to tune in practice [15]. To reconcile theory with empirical observation, we ﬁrst demonstrate that the constants in these results are in fact important, and understanding their role explains much of the real-world performance of SPG.

Illustration To illustrate the point concretely, consider a simple experiment on a single-state Markov Decision Process (MDP) (i.e., a multi-armed bandit) with K = 6 actions. In this case, the SPG of a policy πθ for a given reward vector r [0, 1]K reduces to the update θt+1(a) = θt(a) + η πθt(a) r(a) π θtr , a [K] := {1, ..., K}, and πθt+1 = softmax(θt+1). Fig. 1 shows the result of multiple runs using SPG with full gradients. Depending on whether the last iteration satisﬁes πθT (a ) 0.99, we group the 20 runs as good and bad initializations. As shown in Fig. 1(a) and (b), for good initializations, the sub-optimality (π πθt) r quickly approaches 0, whereas for bad initializations, the iterates get stuck near local optima. Subﬁgure (c) shows average probability of optimal actions, which shows that the trajectories from bad initializations stay near local optima, since the optimal action probabilities stay close to 0. However, we know from the theory that from any initializations SPG must eventually converge to the optimal policy π , and that is indeed the case here: Subﬁgure (d) shows the long-run time to convergence (boxes are 25 to 75th

0 500 1000 1500 2000 2500 0

0 500 1000 1500 2000 2500 0

0 500 1000 1500 2000 2500 0

Figure 1: SPG behavior on single-state MDPs with K = 6 arms, fully parameterized policy (no approximation error), rewards randomly generated (uniform within [0, 1] for each r(a)) and policy randomly initialized on each run, 20 runs. Full gradient SPG updates with stepsize η = 0.4 [12] for T = 3 104 steps. An initialization is good if πθT (a ) 0.99 at the last iterate.

percentiles) for good versus bad initializations, where the y-axis is log T such that πθT (a ) 0.99, showing bad runs take many orders of magnitude longer.

Although these ﬁndings seem not to comport with theory, they can in fact be explained by delving deeper into the detailed nature of the Θ(1/t) rates proved in [12].

Escape time To control the effect of initialization, consider a specialization of the previous problem where we let r = (b + , b, . . . , b) [0, 1]K for some b, such that > 0 is the reward gap. For a given initialization, we say that SPG escapes at time t0 if for all t t0 it holds that (π πθt) r < 0.9 , i.e., after t0 the sub-optimality stays small . Fig. 2(a) shows that as the initial probability of the optimal action πθ1(a ) decreases, the escape time t0 increases proportionally. In particular, the slope in subﬁgure (a) approaches 1 as πθ1(a ) decreases, indicating that log t0 = log πθ1(a )+C, or equivalently t0 = C /πθ1(a ). Two trajectories for SPG on a single-state MDP with K = 5 is shown in Fig. 2(b) and (c). This example reveals that every suboptimal vertex i {2, 3, 4} has the potential to attract the iterates, while also slowing progress to render the sub-optimality plateaus in subﬁgure (c). Therefore, SPG spends some escape time around each suboptimal corner.

-12 -10 -8 -6 -4 6

104 0 2 4 6 8 0

Figure 2: Dependence on initialization and softmax gravity wells.

We can see as SPG follows a trajectory deﬁned by exact gradients, it effectively encounters Softmax Gravity Wells (SGWs) at the vertices (deterministic policies), each of which attracts the trajectory and signiﬁcantly slows down progress in their vicinity. To see why the attraction to suboptimal vertices is possible, consider the SPG in detail: for a single-state MDP, a [K], we have

dπ θ r dθ(a) = πθ(a) r(a) π θ r . (1)

Note that it is possible for an optimal action, say a1, to be less attractive than a suboptimal action a2, even when r(a1) > r(a2), since it is possible to have both r(a1) π θtr > r(a2) π θtr > 0 and πθt(a2) > πθt(a1), and yet still have πθt(a2) r(a2) π θtr > πθt(a1) r(a1) π θtr . This conﬁguration causes the probability on the suboptimal action to stay above the optimal action probability, πθt+1(a2) > πθt+1(a1). Even though the examples and analysis above might seem speciﬁc, they provide the foundation for a useful and informative lower bound. Theorem 1 (Escape time lower bound). Even in a single-state MDP, for any learning rate ηt (0, 1], there exists an initialization of the policy πθ1 and a positive constant C, such that SPG with full gradients cannot escape a suboptimal corner before time t0 := C πθ1(a ), i.e., it will hold that

(π πθt) r 0.9 , (2)

for all t t0, where := r(a ) maxa =a r(a) > 0 is the reward gap of r [0, 1]K.

Theorem 1 shows that for SPG with bounded learning rates (needed for monotonic improvement [1, 12]) the time to escape suboptimal vertices is lower bounded inversely to optimal action probability πθ(a ), which is necessarily small near suboptimal vertices, leading to long suboptimal plateaus.

Failure of SPG heuristics Given the insight from Theorem 1, one might wonder if simple heuristics can compensate for the slow progress of SPG, possibly by using large learning rates or normalizing the policy gradient. We show in the appendix that these heuristics unfortunately do not work well even in simple bandit problems.

Existing observations of plateaus SPG plateaus have been observed in the literature. Previous work [12] did observe this effect empirically, but did not take a deeper look into the underlying causes. With function approximation, feature interference has also been considered to be a source of plateaus [14]. In the multi-agent setting, it has been observed that the non-stationary nature of the environment can also cause difﬁculties for SPG to adapt [8]. However, the analysis in this paper shows that SPG still suffers from plateaus even in the simplest setting (exact gradients, no approximation, stationary environments). In Section 4 we provide additional mathematical insight to explain why the softmax transformation itself is the root cause, which also justiﬁes the name SGW.

3 Escort Transform for Policy Gradient

As explained, a difﬁculty encountered by SPG comes from the πθ(a) factor that appears in the gradient, Eq. (1). This creates a dependence on the current policy that potentially discounts the signal from high-reward actions. Unfortunately, the problem is unavoidable if using SPG with bounded learning rates to perform updates (Theorem 1). Therefore, we study the following alternative transform, which we refer to as the escort transform [3, 18]. Deﬁnition 1 (Escort transform). Given θ : S A R, deﬁne πθ = fp(θ) for p 1 by

πθ(a|s) = |θ(s, a)|p P a A |θ(s, a )|p , for all (s, a) S A. (3)

If there is only one state, the escort transform is deﬁned as πθ(a) = |θ(a)|p/ P

a |θ(a )|p, a [K].

To explain why this alternative transform might help alleviate the problems encountered by the softmax, consider the gradient of expected reward using the escort transform, i.e., the Escort Policy Gradient (EPG), for a single-state MDP, a [K] (detailed calculations are shown in the appendix):

dπ θ r dθ(a) = p sgn{θ(a)} |θ(a)|p 1 P

a |θ(a )|p r(a) π θ r (4)

= p θ p sgn{θ(a)} πθ(a)1 1/p r(a) π θ r . (5)

Note the key difference between SPG and EPG, in which the πθ(a) term in Eq. (1) now becomes πθ(a)1 1/p in Eq. (5). Thus, for any p 1, we have 1 1/p [0, 1), which implies πθ(a)1 1/p > πθ(a) since πθ(a) [0, 1]. This change will have important implications in convergence rate.

Remark 1. πθ(a)1 1/p πθ(a) as p , which suggests that large values of p lead to similar iteration behavior as SPG, whereas small values of p weaken the dependence on πθ(a). In particular, if p = 1 then πθ(a)1 1/p = 1, which entirely eliminates the dependence on current policy πθ.

As is the case for the softmax transform, the expected reward objective remains non-concave over parameter θ when using the alternative escort transform. Proposition 1. θ 7 π θ r is a non-concave function over RK using the map πθ := fp(θ).

Despite the non-concavity, we manage to obtain surprisingly strong convergence results for EPG, with proofs provided in the appendix. In particular, thanks to what we call non-uniform smoothness and the Non-uniform Łojasiewicz (NŁ) inequality enjoyed by the objective, EPG is shown to enjoy an upper bound on the sub-optimality for single-state MDPs that has a strictly better initialization dependence than SPG.

Theorem 2. For a single-state MDP, following the escort policy gradient with any initialization such that |θ1(a)| > 0, a, we obtain the following upper bounds on the sub-optimality for all t 1: 2

(gradient ascent) for p 2, p = 1, with ηt = 2 θt 2 p 9 p2 K1/p , (π πθt) r 9 K1/p

(gradient ﬂow) for p 1, with ηt = θt 2 p p2 , (π πθt) r 1 c2 2/p (t 1) + 1,

where c := inft 1 πθt(a ) > 0 depends on the problem and initialization, but is time-independent.

As p , Theorem 2 implies an O(1/(c2 t)) convergence rate, recovering the same rate for SPG [12], as expected (Remark 1). For p < , EPG achieves the same O(1/t) rate as SPG, but enjoys a strictly better c2 2/p > c2 dependence. In particular, for p = 1, there is no dependence on c, which is also consistent with Remark 1. On the other hand, K1/p K increases as p decreases, which means it is not always good to choose small p values due to trade-off between K and c.

Similar results can in fact be obtained for EPG in general ﬁnite MDPs, denoted as M = (S, A, r, P, γ), where S and A are ﬁnite state and action spaces, r : S A R is the reward function, P : S A (S) is the transition function, (X) denotes the set of probability distributions over any ﬁnite set X, and γ [0, 1) is the discount factor. Let V π(ρ) := Es0 ρ( ),at π( |st),st+1 P( |st,at) P t=0 γtr(st, at) denote the expected return (value function) achieved by policy π, where ρ (S) is an initial state distribution. The goal is to maximize the value function, i.e., to ﬁnd a policy π that attains the value V (ρ) := maxπ:S (A) V π(ρ).

Theorem 3. Following the escort policy gradient with any initialization such that |θ1(s, a)| > 0,

(s, a), and ηt(s) = (1 γ)3 θt(s, ) 2 p 10 p2 A2/p to get {θt}t 1, for all t 1, the following sub-optimality upper bounds hold for πθt,

for p 2 and p = 1, V (ρ) V πθt(ρ) 20 A2/p S c2 2/p (1 γ)6 t

where c := infs S inft 1 πθt(a (s)|s) > 0 is problemand initialization-dependent constant, A := |A| and S := |S| are the total number of actions and states, respectively, and µ (S) is an initial state distribution which provides initial states for the policy gradient method. Remark 2. Using p = 1 in Theorem 3, the iteration complexity of EPG depends on polynomial functions of S and A, which signiﬁcantly improves the corresponding results for SPG [12, Theorem 4], where the worst case dependence can be exponential in S and A.

Finally, as for SPG, adding entropy regularization leads to linear convergence rates for EPG. Note that SPG with entropy regularization enjoys a linear convergence rate O(1/ exp{c2t}) with dependence on c = inft 1 min(s,a) πθt(a|s) [12]. Our results show that EPG with entropy regularization has strictly better dependence than SPG. Theorem 4. For an entropy regularized MDP with ﬁnite states and actions, following the escort policy gradient with any initialization such that |θ1(s, a)| > 0, (s, a), and ηt = (1 γ)3/(10 p2 A1/p+cτ) to get {θt}t 1, for all t 1, the following sub-optimality upper bounds hold for πθt:

for p 2, V π τ (ρ) V πθt(ρ) 1/µ exp{Cτ c 2 t} 1 + τ log A

(1 γ)2 , (6)

where c > c := inf(s,a) inft 1 πθt(a|s) > 0, τ is the temperature for entropy regularization, π τ is the softmax optimal policy, and cτ, Cτ are problem-dependent constants.

Relationship to Mirror Descent (MD) As an additional observation, note that simply removing πθ(a) in Eq. (1) yields an update θt+1 = θt(a) + ηt r(a) and πθt+1 = softmax(θt+1), which can be combined to yield an update πθt+1(a) πθt(a) exp{ηt r(a)} that is equivalent to Mirror Descent

2Here, gradient ascent, as expected, refers to θt+1 = θt +ηt dπ θt r

dθt and gradient ﬂow refers to the continuous

version when dθt

dt = ηt dπ θt r

(MD) with KL regularization. Given this similarity between SPG, EPG and MD, one might hope that EPG could be reduced to a particular version of MD. However, unlike SPG and MD, the EPG gradient does not specify a conservative vector ﬁeld and cannot be recovered by MD using any regularization.

Remark 3 (EPG cannot be reduced to MD). Recall that for a (convex) potential Φ: R and its Bregman divergence DΦ : R, the MD update is πt+1 =arg maxπ π r (1/ηt) DΦ(π πt). In particular, using Φ(π) = π log π as the potential and DΦ(π π ) = DKL(π π ) as the divergence one obtains πθt+1(a) πθt(a) exp{ηt r(a)}. Equivalently, this update can be expressed πθt+1 = arg maxπ π θt+1 Φ(π) where θt+1 = θt(a) + ηt r(a).

Now suppose EPG is MD, i.e., there is some Φ such that fp(θt+1) = arg maxπ π θt+1 Φ(π). Then we would have to have fp(θt+1) = Φ (θt+1) where Φ is the Fenchel conjugate of Φ. Taking the derivative w.r.t. θ yields

= p diag(1/θ) diag(πθ) πθπ θ (?) = d2Φ (θ)

By Schwarz s theorem, d2Φ (θ)

dθ2 must be symmetric, however diag(1/θ) diag(πθ) πθπ θ is not symmetric. Therefore, there cannot be a regularizer Φ that makes EPG equivalent to MD.

Remark 3 implies that standard techniques for analyzing mirror descent (e.g., Bregman divergence and convex duality) cannot be directly applied to EPG, necessitating our analysis based on the non-uniform smoothness and NŁ inequalities for Theorems 2 to 4.

Experimental Veriﬁcation To support these ﬁndings and reveal some of the practical implications of EPG versus SPG, we conducted a simple experiment on a single-state MDP with K = 3 and r = (0.2, 0.9, 1.0) . Fig. 3(a) depicts the dπθ(a )

dt values for SPG, where the dark regions around the corners show areas of slow progress. In particular, the region around the lower-right suboptimal corner exhibits dπθ(a )

dt < 0, and πθ(a ) will actually decrease under SPG updating in this region, prolonging the escape time according to Theorem 1. In short, the dark regions correspond to SGWs for SPG. Subﬁgure (b) further shows how SPG is attracted toward the suboptimal corner, visually consistent with subﬁgure (a). By contrast, the solid lines indicate EPG methods with different p values. As noted in Remark 1, smaller p values have better resistance against attraction to SPG gravity wells, while larger p values behave more similarly to SPG. We also observe that MD (with KL regularization) has similar performance to EPG with p = 2 in this case. Finally, Subﬁgure (c) plots the suboptimaliy gap before (π πθt) r 0.005 is achieved. It is clear that SPG does get stuck on a suboptimal plateau while EPG methods do not suffer from this disadvantage. We note that EPG curves for p 2 behave nicer than p = 1 since the escort is differentiable when p 2.

0 0.5 1 1 0.5 0

0 500 1000 1500 2000 0

0 50 100 150 200 0

Figure 3: Empirical visualization for EPG and SPG.

4 Non-uniform Łojasiewicz Coefﬁcient: An Underlying Explanation

Remark 1 provides an intuition for why EPG has better initialization dependence than SPG. This intuition can be formalized using the notion of Non-uniform Łojasiewicz (NŁ) coefﬁcient, which plays an important role here since both SPG [12] and EPG analyses are based on NŁ inequalities.

Deﬁnition 2 (Non-uniform Łojasiewicz (NŁ) coefﬁcient). A function f :X R has NŁ coefﬁcient C(x) > 0 if it satisﬁes NŁ inequality with coefﬁcient C(x), i.e., there exists ξ [0, 1] such that for all x X, f(x) 2 C(x) |f(x) f(x )|1 ξ.

In Deﬁnition 2, ξ is called NŁ degree, which impacts the convergence rates of SPG methods [12, Deﬁnition 1]. From a result in [12], if πθ = softmax(θ), then π θ r has NŁ coefﬁcient πθ(a ); that is dπ θ r dθ

2 = diag(πθ) πθπ θ r 2 πθ(a ) (π πθ) r. (8)

Moreover, this coefﬁcient is not improvable and it appears in the SPG convergence rate O(1/(c2 t)), where c := inft 1 πθt(a ). Now consider EPG. If πθ = fp(θ), then we have dπ θ r dθ

2 = p diag(1/θ) diag(πθ) πθπ θ r 2 p πθ(a )

|θ(a )| (π πθ) r. (9)

This implies that πθ(a )

|θ(a )| = 1 θ p πθ(a )1 1/p, where πθ(a )1 1/p > πθ(a ) provides strictly larger (partial) NŁ coefﬁcient, hence in Theorem 2 EPG obtains a strictly better result than SPG.

The improvement of NŁ coefﬁcient explains a better dependence of EPG on initialization. It is then natural to ask whether the escort transform can also beneﬁt other scenarios, which is answered afﬁrmatively in the next section.

5 Escort Transform for Cross Entropy

We now turn to classiﬁcation, where the goal is to learn a classiﬁer that minimizes the cross-entropy loss. As in RL, the softmax transform is the default choice for parameterizing a probabilistic classiﬁer. Different from RL where the objective is linear, the objective here involves log probabilities:

min θ:A R log πθ(ay) = H(y) + min θ:A R DKL(y πθ), (10)

where πθ = softmax(θ), y {0, 1}K is a one-hot vector encoding the class label, and ay is the true label class so that y(ay) = 1. Note that the entropy H(y) = 0 here. The objective in Eq. (10) is smooth and convex in θ, which implies that gradient descent will achieve an O(1/t) rate [13]. Furthermore, for θ that satisﬁes mina πθ(a) πmin with some constant πmin > 0 (πθ is bounded away from the simplex boundary), the objective is strongly convex, resulting in an even better, linear rate O(e t). Despite these nice properties, we still ﬁnd that the softmax transform proves problematic for gradient descent optimization. We refer to this new disadvantage as softmax damping .

Illustration Consider running gradient descent in a simple experiment where K = 10 and y a one-hot vector. Let δt := log πθt(ay). If one hopes for a linear convergence rate, i.e., δt = O(e t), then log δt = O(t) is expected. But Fig. 4(a) shows a different picture with a ﬂattening slope. Subﬁgure (b) plots log δt as a function of log t, which shows a straight line for sufﬁciently large t with a slope approaching 1. This ﬁgure veriﬁes the convergence rate is indeed δt = O(1/t), instead of the linear O(e t) rate. Subﬁgure (c) shows the ℓ2 measure y πθt 2 2 also has a similar rate, indicating that this is an inherent optimization phenomenon and is independent of the measurement.

0 2000 4000 6000 8000 10000 -10

0 2 4 6 8 10 -10

0 2 4 6 8 10 -18

0 50 100 150 -40

Figure 4: Softmax damping phenomenon and escort cross entropy.

NŁ Coefﬁcient Explanation The NŁ coefﬁcient can be used to explain why this rate degeneration occurs for softmax cross entropy (SCE). Note that for πθ = softmax(θ) we obtain d{DKL(y πθ)}

2 = y πθ 2 2 min a πθ(a) DKL(y πθ). (11)

Once again the mina πθ(a) term cannot be eliminated for the softmax transform, but here it has a different consequence than before. To see the NŁ coefﬁcient of SCE cannot be improved, consider the example where y = (0, 1) and π = (ϵ, 1 ϵ) , where ϵ > 0 is small. Note that DKL(y π) = log (1 ϵ) ϵ and y π 2 2 = 2 ϵ2, which means for any constant C > 0, we have C DKL(y π) C ϵ > 2 ϵ2 = y π 2 2. Therefore, for any Łojasiewicz-type inequality, C necessarily depends on mina πθ(a). Now for any convergent sequence πθt, i.e., such that DKL(y πθt) 0, we necessarily have mina πθt(a) 0, which makes the gradient information insufﬁcient to sustain a linear convergence rate. That is, the fast convergence rate is damped in this case. The difference between this phenomenon and the previous softmax gravity well is that here the vanishing NŁ coefﬁcients change the rates rather than the constant in the bound on the sub-optimality gap.

Escort Cross Entropy As in Section 3 for policy gradient, we propose to also use the escort transform for cross entropy minimization. A simple calculation for πθ = fp(θ) shows d{DKL(y πθ)}

2 = p diag(1/θ)(y πθ) 2 2 p2

θ 2p min a πθ(a)1 2/p DKL(y πθ). (12)

Note that the term mina πθ(a)1 2/p > mina πθ(a) is strictly better than the softmax cross entropy when p 2. In particular, for p = 2, the escort cross entropy (ECE) has (partial) NŁ coefﬁcient mina πθ(a)1 2/p = 1, which fully eliminates the dependence on the current policy πθ. This leads to our last main result, which restores the linear convergence rate. Theorem 5. Using the escort transform with p = 2 and gradient descent on the cross entropy

objective with learning rate ηt = θt 2 p 4 (3+c2 1), we obtain for all t 1,

log πθt(ay) = DKL(y πθt) DKL(y πθ1) exp n (t 1) 2 (3 + c2 1)

where 1/c2 1 = πθ1(ay) (0, 1] only depends on initialization.

For reference, we run gradient descent on the cross entropy objective in the same experiment above, but with the escort transform. As shown in Fig. 4(d), log δt now becomes linear in t, or equivalently log πθt(ay) = C e c t, verifying the theoretical ﬁnding of Theorem 5.

6 Experimental Results

We conduct several experiments to verify the effectiveness of the proposed escort transform in policy gradient and cross entropy minimization.

First, we conduct experiments on one-state MDPs with K = 10, 50, and 100, and 20 runs for each K value and for each algorithm. In each run, the reward r [0, 1]K and πθ1 are randomly generated. The total iteration number T = 5 104. As shown in Fig. 5(a), EPG with p = 2 quickly converges to optimal policies consistently across all the K values, signiﬁcantly outperforming SPG.

Second, we compare the algorithms on Four-room environment for 20 runs. There is one goal with reward 1.0 and 4 sub-goals ( sub-goals mean goals with lower rewards) with reward 0.7 as shown in Fig. 5(b). At a (sub-)goal state, the agent can step away then step back to receive rewards. The policy is parameterized by one hidden layer Re LU neural network with 64 hidden nodes. In each run, the starting position is randomly generated. We use value iteration to calculate V and we calculate the true gradient by closed form. As shown in Fig. 5(c), SPG is easily stuck in plateaus due to the presence of the sub-goals, while EPG with p = 2 quickly achieves the optimal goal.

Next, we do experiments on MNIST dataset. For each (x, y), where x R784 is image data and y {0, 1}10 is the true label, the training objective is 1 πθ(ay|x), where y(ay) = 1. We use policy gradient methods, since the mis-classiﬁcation probability minimization problem is a special case of expected reward maximization. We use one hidden layer neural network with 512 hidden nodes and Re LU activation to parameterize θ. The dataset is split into training set with 55000, validation set with 5000, and testing set with 10000 data points. As shown in Fig. 6(a) and (b), for both training objective and test error, SPG has plateaus due to SGWs, which is consistent with the observation in [5]. At the same time, EPG with p = 4 does not have this disadvantage: it converges quickly and achieves better results than SPG. Experiments for other p values are shown in the appendix.

0 500 1000 1500 2000 2500 3000 0

0 100 200 300 400 500 0

Figure 5: Results on one-state MDPs and Four-room.

We use mini-batch stochastic gradient descent in this experiment, and the results show that with stochastic gradients and neural network function approximations, (1) SPG still plateaus even when starting from nearly uniform initializations; (2) EPG outperforms SPG in terms of not suffering from plateaus even with estimated gradients.

0 20 40 60 80 100 0

0 20 40 60 80 100 0

0 20 40 60 80 100 0

0 20 40 60 80 100 150

Figure 6: Results on MNIST.

Finally, for SL, we compare ECE and SCE on MNIST. For each training data (x, y), the training objective is log πθ(ay|x), where y(ay) = 1. The neural network and dataset settings are the same as above. As shown in Fig. 6(c) and (d), ECE with p = 2 is faster than SCE to achieve the same training objective, which beneﬁts generalization, providing smaller test error than SCE.

7 Conclusion and Future Work

We discovered two phenomena that arise from the use of the standard softmax probability transformation in reinforcement learning and supervised learning, and proposed the escort transform to alleviate or eliminate these disadvantages. Our ﬁndings of the softmax gravity well and softmax damping phenomena challenge the common practice of using the softmax transformation in machine learning. However, there are other factors to consider when assessing such transformations in machine learning, such as the temperature of the softmax, or how the different transforms might impact generalization in the learned models. An important direction for future work is to investigate whether similar phenomena occur in other scenarios where the softmax is commonly utilized, such as attention models and exponential exploration.

Our underlying explanation using the concept of Non-uniform Łojasiewicz (NŁ) coefﬁcient also provides an alternative theory to systematically study probability transforms, which goes beyond the classic convex matching loss theory [2, 9] and guarantees better optimization results. We expect further use of the NŁ coefﬁcient to be beneﬁcial in other problems in machine learning.

Broader Impact

This research pursues a fundamental and mostly theoretical goal of understanding how a basic component in machine learning, the softmax transformation, impacts the convergence properties of subsequent optimization methods. The implications of this research are very high level and broad, since we investigate a widely used component. It is difﬁcult to identify speciﬁc impacts, since this research does not target any speciﬁc application area that would directly impact people or society. If forced to make society level claims, we could attempt to claim that modifying architectures in ways that that improve optimization efﬁciency would have an effect on the overall energy footprint consumed by machine learning technologies, given how much computation is currently being expended on training softmax classiﬁers and policies.

Acknowledgments and Disclosure of Funding

Jincheng Mei and Lihong Li would like to thank Christoph Dann for providing feedback on a draft of this manuscript. Csaba Szepesvári gratefully acknowledges funding from the Canada CIFAR AI Chairs Program, Amii and NSERC. Dale Schuurmans acknowledges funding from Amii and NSERC.

[1] Alekh Agarwal, Sham M Kakade, Jason D Lee, and Gaurav Mahajan. Optimality and approximation with policy gradient methods in Markov decision processes. ar Xiv preprint ar Xiv:1908.00261, 2019.

[2] Peter Auer, Mark Herbster, and Manfred KK Warmuth. Exponentially many local minima for single neurons. In Advances in neural information processing systems, pages 316 322, 1996.

[3] Christian Beck and Friedrich Schögl. Thermodynamics of chaotic systems: an introduction. Number 4. Cambridge University Press, 1995.

[4] John S Bridle. Training stochastic model recognition algorithms as networks can lead to maximum mutual information estimation of parameters. In Advances in neural information processing systems, 1989.

[5] Minmin Chen, Ramki Gummadi, Chris Harris, and Dale Schuurmans. Surrogate objectives for batch policy optimization in one-step decision making. In Advances in Neural Information Processing Systems, pages 8825 8835, 2019.

[6] Alexandre de Brébisson and Pascal Vincent. An exploration of softmax alternatives belonging to the spherical loss family. In International Conference on Learning Representations, 2015.

[7] Jerome Friedman, Trevor Hastie, and Robert Tibshirani. The elements of statistical learning. Springer series in statistics New York, 2001.

[8] Daniel Hennes, Dustin Morrill, Shayegan Omidshaﬁei, Remi Munos, Julien Perolat, Marc Lanctot, Audrunas Gruslys, Jean-Baptiste Lespiau, Paavo Parmas, Edgar Duenez-Guzman, et al. Neural replicator dynamics. ar Xiv preprint ar Xiv:1906.00190, 2019.

[9] Jyrki Kivinen and Manfred KK Warmuth. Relative loss bounds for multidimensional regression problems. In Advances in neural information processing systems, pages 287 293, 1998.

[10] Anirban Laha, Saneem A. Chemmengath, Priyanka Agrawal, Mitesh M. Khapra, Karthik Sankaranarayanan, and Harish G. Ramaswamy. On controllable sparse alternatives to softmax. In Advances in Neural Information Processing Systems, page 6423 6433, 2018.

[11] Yann Le Cun, Sumit Chopra, Raia Hadsell, Marc Aurlio Ranzato, and Fu Jie Huang. A tutorial on energy based learning. In Predicting Structured Data. MIT Press, 2006.

[12] Jincheng Mei, Chenjun Xiao, Csaba Szepesvári, and Dale Schuurmans. On the global convergence rates of softmax policy gradient methods. ar Xiv preprint ar Xiv:2005.06392, 2020.

[13] Yurii Nesterov. Lectures on convex optimization, volume 137. Springer, 2018.

[14] Tom Schaul, Diana Borsa, Joseph Modayil, and Razvan Pascanu. Ray interference: a source of plateaus in deep reinforcement learning. ar Xiv preprint ar Xiv:1904.11455, 2019.

[15] John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trust region policy optimization. In International conference on machine learning, pages 1889 1897, 2015.

[16] Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction. MIT Press, second edition, 2018.

[17] Richard S Sutton, David A Mc Allester, Satinder P Singh, and Yishay Mansour. Policy gradient methods for reinforcement learning with function approximation. In Advances in neural information processing systems, pages 1057 1063, 2000.

[18] Constantino Tsallis, Renio S. Mendes, and Anel R. Plastino. The role of constraints within generalized nonextensive statistics. Physica A: Statistical Mechanics and its Applications, 261(3-4):534 554, 1998.

[19] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in Neural Information Processing Systems, pages 5998 6008, 2017.