# provably_efficient_maximum_entropy_exploration__b636063d.pdf

Provably Efﬁcient Maximum Entropy Exploration

Elad Hazan 1 2 Sham M. Kakade 3 4 2 Karan Singh 1 2 Abby Van Soest 1 2

Suppose an agent is in a (possibly unknown) Markov Decision Process in the absence of a reward signal, what might we hope that an agent can efﬁciently learn to do? This work studies a broad class of objectives that are deﬁned solely as functions of the state-visitation frequencies that are induced by how the agent behaves. For example, one natural, intrinsically deﬁned, objective problem is for the agent to learn a policy which induces a distribution over state space that is as uniform as possible, which can be measured in an entropic sense. We provide an efﬁcient algorithm to optimize such such intrinsically deﬁned objectives, when given access to a black box planning oracle (which is robust to function approximation). Furthermore, when restricted to the tabular setting where we have sample based access to the MDP, our proposed algorithm is provably efﬁcient, both in terms of its sample and computational complexities. Key to our algorithmic methodology is utilizing the conditional gradient method (a.k.a. the Frank-Wolfe algorithm) which utilizes an approximate MDP solver.

1. Introduction

A fundamental problem in reinforcement learning is that of exploring the state space. How do we understand what is even possible in the context of a given environment in the absence of a reward signal?

This question has received a lot of attention, with approaches

1Department of Computer Science, Princeton University 2Google AI Princeton 3Allen School of Computer Science and Engineering, University of Washington 4Department of Statistics, University of Washington. Correspondence to: Elad Hazan <ehazan@google.com>, Sham Kakade <sham@cs.washington.edu>, Karan Singh <karans@princeton.edu>, Abby Van Soest <asoest@princeton.edu>.

Proceedings of the 36 th International Conference on Machine Learning, Long Beach, California, PMLR 97, 2019. Copyright 2019 by the author(s).

such as learning with intrinsic reward and curiosity driven methods, surveyed below. Our work studies a class of objectives1 that is deﬁned solely as function of the state-visitation frequencies. A natural such objective is ﬁnding a policy that maximizes the entropy of the induced distribution over the state space. More generally, our approach extends to any concave function over distributions.

Suppose the MDP is fully and precisely known, in terms of states, actions, and the entire transition matrix. Then maximizing the entropy can be recast as a convex optimization problem (see Section 3.2.2 or (De Farias & Van Roy, 2003)) over the space of state-visitation frequencies induced by the exhaustive set of all policies. However, most RL instances that are common in practice exhibit at least one of several complications: prohibitively large state space (i.e. Chess or Go) unknown transition matrix (as in common Atari games) These scenarios often require function approximation, ie. restricting the search to a non-linearly parameterized policy class (eg. neural networks), which makes the entropy maximization problem non-convex.

As a remedy for the computational difﬁculty, we propose considering an approximate planning oracle: an efﬁcient method that given a well-speciﬁed reward signal can ﬁnd an optimizing policy. Such sample-based planning oracles have been empirically observed to work well with non-linearly parameterized policy classes. Given such an oracle, we give a provably efﬁcient method for exploration based on the conditional gradient (or Frank-Wolfe) algorithm (Frank & Wolfe, 1956).

Formally, we show how to generate a sequence of reward signals, that when sequentially optimized give rise to a policy with an entropy on the state distribution close to optimal. Our main theorem gives a bound on the number of calls to the planning oracle, which is independent of the size of the state space of the MDP and that of the policy class. Next, we outline an efﬁcient construction of such oracles and state the resultant sample & computational complexity in the tabular MDP setting. As a proof of concept, we implement the proposed method and demonstrate experiments over

1In contrast to scalar rewards, such objectives permit a broader framework for reward speciﬁction, and may be useful in other contexts.

Provably Efﬁcient Maximum Entropy Exploration

several mainstream RL tasks in Section 5.

1.1. Informal statement of contributions

To facilitate exploration in potentially unknown MDPs within a restricted policy class Π, we assume access to the environment using the following two oracles:

Approximate planning oracle: Given a reward function (on states) r : S R and a sub-optimality gap ε, the planning oracle returns a policy π = APPROXPLAN(r, ε) with the guarantee that V (π) maxπ Π V (π) ε, where V (π) is the value of policy π.

State distribution estimate oracle: A state distribution oracle estimates the state distribution ˆdπ = DENSITYEST(π, ε) of any given (non-stationary) policy π, guaranteeing that dπ ˆdπ ε.

Given access to these two oracles, we describe a method that provably optimizes any continuous and smooth objective over the state-visitation frequencies. Of special interest is the maximum entropy and relative entropy objectives. Theorem 1.1 (Main Theorem - Informal). There exists an efﬁcient algorithm (Algorithm 1) such that for any βsmooth measure R, and any ε > 0, in O( 1

ε) calls to APPROXPLAN & DENSITYEST , it returns a policy π with

R(d π) max π Π R(dπ) ε .

1.2. Related work

We review related works in this section.

Reward Shaping & Imitation Learning: Direct optimization approaches to RL (such as policy gradient methods) tend to perform favorably when random sequences of actions lead the agent to some positive reward, but tend to fail when the rewards are sparse or myopic. Thus far, the most practical approaches to address this have either been through some carefully constructed reward shaping (e.g. (Ng et al., 1999) where dense reward functions are provided to make the optimization problem more tractable) or through imitation learning (Abbeel & Ng, 2004; Ross et al., 2011) (where an expert demonstrates to the agent how to act).

PAC-RL Learning: For the case of tabular Markov decision processes, the balance of exploration and exploitation has been addressed in that there are a number of methods which utilize conﬁdence based reward bonuses to encourage exploration in order to ultimately behave near optimally (Kearns & Singh, 2002; Kakade, 2003; Strehl et al., 2006; Lattimore & Hutter, 2014; Dann & Brunskill, 2015; Szita & Szepesv ari, 2010; Azar et al., 2017). (Lim & Auer, 2012) offer a Dijkstra-like algorithm for discovering incrementally reachable states in the tabular setting.

Count-based Models & Directed Exploration: There are

a host of recent empirical success using deep RL methods which encourage exploration in some form (Mnih et al., 2015; Silver et al., 2016). The approaches are based on a few related ideas: that of encouraging exploration through state visitation frequencies (e.g. (Ostrovski et al., 2017; Bellemare et al., 2016; Tang et al., 2017)) and those based on a intrinsic reward signal derived from novelty or prediction error (Lopes et al., 2012; Pathak et al., 2017; Savinov et al., 2018; Fu et al., 2017; Mohamed & Jimenez Rezende, 2015; Houthooft et al., 2016; Weber et al., 2017), aligning an intrinsic reward to the target objective (Kaelbling, 1993; Chentanez et al., 2005; Singh et al., 2010; 2009; Zheng et al., 2018), or sample based approaches to tracking of value function uncertainty (Osband et al., 2016; 2018).

Intrinsic Learning: Works in (Chentanez et al., 2005; Singh et al., 2009; 2010) established computational theories of intrinsic reward signals (and how it might help with downstream learning of tasks) and other works also showed how to incorporate intrinsic rewards (in the absence of any true reward signal) (Warde-Farley et al., 2018; Burda et al., 2018b;a; Nair et al., 2018). The potential beneﬁt is that such learning may help the agent reach a variety of achievable goals and do well on other extrinsically deﬁned tasks, not just the task under which it was explicitly trained for under one speciﬁc reward function (e.g. see (Chentanez et al., 2005; Singh et al., 2009; Warde-Farley et al., 2018; Nair et al., 2018)).

2. Preliminaries

Markov decision process: An inﬁnite-horizon discounted Markov Decision Process is a tuple M = (S, A, r, P, γ, d0), where S is the set of states, A is the set of actions, and d0 is the distribution of the initial state s0. At each timestep t, upon observing the state st, the execution of action at triggers an observable reward of rt = r(st, at) and a transition to a new state st+1 P( |st, at). The performance on an inﬁnite sequence of states and actions (hereafter, referred to as a trajectory) is judged through the (discounted) cumulative reward it accumulates, deﬁned as

V (τ = (s0, a0, s1, a1, . . . )) = (1 γ)

t=0 γtr(st, at).

Policies: A policy is a (randomized) mapping from a history, say (s0, a0, r0, s1, a1, r1 . . . st 1, at 1, rt 1), to an action at. A stationary policy π is a (randomized) function which maps a state to an action in a time-independent manner, i.e. π : S (A). When a policy π is executed on some MDP M, it produces a distribution over inﬁnite-length trajectories τ = (s0, a0, s1, a1 . . . ) as speciﬁed below.

P(τ|π) = P(s0)

i=0 (π(ai|si)P(si+1|si, ai))

Provably Efﬁcient Maximum Entropy Exploration

The (discounted) value Vπ of a policy π is the expected cumulative reward an action sequence sampled from the policy π gathers.

Vπ = E τ P ( |π) V (τ) = (1 γ) E τ P ( |π)

t=0 γtr(st, at)

Induced state distributions: The t-step state distribution and the (discounted) state distribution of a policy π that result are

dt,π(s) = P(st = s|π) = X

all τ with st=s P(τ|π), (2.1)

dt,π(s, a) = P(st = s, at = a|π) = X

all τ with st=s,at=a P(τ|π),

dπ(s) = (1 γ)

t=1 γtdt,π(s), (2.3)

dπ(s, a) = (1 γ)

t=1 γtdt,π(s, a). (2.4)

The latter distribution can be viewed as the analogue of the stationary distribution in the inﬁnite horizon setting.

Mixtures of stationary policies: Given a sequence of k policies C = (π0, . . . πk 1), and α k (the simplex), we deﬁne πmix = (α, C) to be a mixture over stationary policies. The (non-stationary) policy πmix is one where, at the ﬁrst timestep t = 0, we sample policy πi with probability αi and then use this policy for all subsequent timesteps. In particular, the behavior of a mixture πmix with respect to an MDP is that it induces inﬁnite-length trajectories τ = (s0, a0, s1, a1 . . . ) with the probability law :

P(τ|πmix) =

i=0 αi P(τ|πi) (2.5)

and the induced state distribution is:

i=0 αidπi(s). (2.6)

Note that such a distribution over policies need not be representable as a stationary stochastic policy (even if the πi s are stationary) due to that the sampled actions are no longer conditionally independent given the states.

3. The Objective: Max Ent Exploration

As each policy induces a distribution over states, we can associate a concave reward functional R( ) with this induced distribution. We say that a policy π is a maximum-entropy exploration policy, also to referred to as the max-ent policy, if the corresponding induced state distribution has the

maximum possible R(dπ) among the policy class Π. When considering the class of all policies, Lemma 3.3 assures us that the search over the class of stationary policies is sufﬁcient. π arg max π Π R(dπ).

Our goal is to ﬁnd a policy that induces a state distribution with a comparable value of the reward functional.

3.1. Examples of reward functionals

A possible quantity of interest that serves as a motivation for considering such functionals is the entropy of the induced distribution2.

max π Π{H(dπ) = E s dπ log dπ(s)}

The same techniques we derive can also be used to optimize other entropic measures. For example, we may be interested in minimizing:

KL(dπ||Q) = E s dπ log dπ(s)

for some given distribution Q(s). Alternatively, we may seek to minimize a cross entropy measure:

E s Q log 1 dπ(s) = KL(Q||dπ) + H(Q)

where the expectation is now under Q. For uniform Q, this latter measure may be more aggressive in forcing π to have more uniform coverage than the entropy objective.

3.2. Landscape of the objective function

In this section, we establish that the entropy of the state distribution is not a concave function of the policy. Similar constructions can establish analogous statements for other non-trivial functionals. Subsequently, when the policy class in consideration is exhasutive, we discuss a possible convex reformulation of the objective in the space of induced distributions which constitute a convex set.

3.2.1. NON-CONVEXITY IN THE POLICY SPACE

Despite the concavity of the entropy function, our overall maximization problem is not concave as the state distribution is not an afﬁne function of the policy. This is stated precisely in the following lemma. Lemma 3.1. H(dπ) is not concave in π.

Proof. Figure 1 demonstrates the behavior of π0, π1, π2 on a 6-state MDP with binary actions. Note that for sufﬁciently large γ 1 and any policy π, the discounted

2Please note the distinction from the conditional entropy of actions given the state, e.g. (Todorov, 2007; Haarnoja et al., 2018).

Provably Efﬁcient Maximum Entropy Exploration

s2,0 s2,1 s2,2

(a) π0, d2,π0 = 3

s2,0 s2,1 s2,2

(b) π1, d2,π1 = 1

s2,0 s2,1 s2,2

(c) π2, d2,π2 = 1

Figure 1. Description of π0, π1, π2.

state distribution converges to the distribution on the states at the second timestep, or formally dπ d2,π. Now with the realization π0 = π1+π2

2 , observe that d2,π0 is not uniform on {s2,0, s2,1, s2,2}, implying that H(d2,π0) < H(d2,π1)+H(d2,π2)

Lemma 3.2. For any policy π and MDP M, deﬁne the matrix Pπ R|S| |S| so that

Pπ(s , s) = X

a A π(a|s)P(s |s, a).

Then it is true that

1. Pπ is linear in π,

2. dt,π = P t πd0 for all t 0,

3. dπ = (1 γ)(I γPπ) 1d0.

Proof. Linearity of Pπ is evident from the deﬁnition. (2,3) may be veriﬁed by calculation.

3.2.2. CONVEXITY IN THE DISTRIBUTION SPACE

Deﬁne the set of all induced distributions as K = {d : d(s, a) 0 and satisﬁes the constraints stated below}. For every d K, it is possible to construct a policy π with dπ = d, and for every π, dπ K holds (Puterman, 2014). X

a d(s, a) = (1 γ)d0(s) + γ X

s ,a P(s|s , a )d(s , a )

For an exhaustive policy class, the search for a max-ent policy can be recast as a convex optimization problem over the space of distributions.

max d K R(d).

Although the above reduction is outlined for an exhaustive policy class, similar reductions are possible for linearlyparameterized policy classes (Peters et al., 2010; Neu et al., 2017). These techniques can be extended to the case of MDPs with unknown dynamics (De Farias & Van Roy, 2003).

3.2.3. SUFFICIENCY OF STATIONARY POLICIES

The set of non-Markovian policies is richer than the set of Markov stationary policies in terms of the distributions over trajectories each may induce. A priori, it is not evident that maximizing R(dπ) over the set of stationary policies is sufﬁcient to guarantee the optimality in a larger class of all policies. Lemma 3.3 establishes this claim by equating the set of achievable induced state distributions for these two sets of policies.

Lemma 3.3. (Puterman, 2014) For any possibly non Markovian policy π, deﬁne a stationary Markov policy π

as π (a|s) = dπ(s,a)

dπ(s) . Then, dπ = dπ .

4. Algorithms & Main Results

The algorithm maintains a distribution over policies, and proceeds by adding a new policy to the support of the mixture and reweighing the components. To describe the algorithm, we will utilize access to two kinds of oracles. The constructions for these are detailed in later sections.

Approximate planning oracle: Given a reward function (on states) r : S R and a sub-optimality gap ε1, the planning oracle returns a policy3 π = APPROXPLAN(r, ε1) with the guarantee that Vπ maxπ Π Vπ ε1.

State distribution estimate oracle: A state distribution oracle estimates the state distribution ˆdπ = DENSITYEST(π, ε0) of any given (non-stationary) policy π, guaranteeing that dπ ˆdπ ε0.

3As the oracle is solving a discounted problem, we know the optimal value is achieved by a stationary policy.

Provably Efﬁcient Maximum Entropy Exploration

Algorithm 1 Maximum-entropy policy computation.

1: Input: Step size η, number of iterations T, planning oracle error tolerance ε1 > 0, state distribution oracle error tolerance ε0 > 0, reward functional R. 2: Set C0 = {π0} where π0 is an arbitrary policy. 3: Set α0 = 1. 4: for t = 0, . . . , T 1 do 5: Call the state distribution oracle on πmix,t = (αt, Ct):

ˆdπmix,t = DENSITYEST (πmix,t, ε0)

6: Deﬁne the reward function rt as

rt(s) = R( ˆdπmix,t) := d R(X)

X= ˆdπmix,t

7: Compute the (approximately) optimal policy on rt:

πt+1 = APPROXPLAN (rt, ε1) .

8: Update πmix,t+1 = (αt+1, Ct+1) to be

Ct+1 = (π0, . . . , πt, πt+1), (4.1)

αt+1 = ((1 η)αt, η). (4.2)

9: end for 10: return πmix,T = (αT , CT ).

We shall assume in the following discussion that the reward functional R is β-smooth, B-bounded, and that it satisﬁes the following inequality for all X, Y .

R(X) R(Y ) β X Y (4.3)

βI 2R(X) βI; R(X) B (4.4)

Theorem 4.1 (Main Theorem). For any ε > 0, set ε1 = 0.1ε, ε0 = 0.1β 1ε, and η = 0.1β 1ε. When Algorithm 1 is run for T iterations where:

T 10βε 1 log 10Bε 1 ,

we have that:

R(dπmix,T ) max π Π R(dπ) ε .

Before we begin the proof, we state the implication for maximizing the entropy of the induced distribution. While the entropy objective is, strictly speaking, not smooth, one may consider a smoothed alternative Hσ deﬁned below.

Hσ(dπ) = Es dπ log(dπ(s) + σ)

When the algorithm is fed Hσ as the proxy reward functional, it is possible make sub-optimality guarantees on the true objective H. Lemma A.1 (D) relates the entropy functional

H to its smoothed variant Hσ, while the rest of the lemma quantiﬁes smoothness of Hσ. The factors of |S| incurred below are a consequence of imposed smoothing on H, and are not necessary for naturally smooth objectives. Corollary 4.2. For any ε > 0, set σ = 0.1ε

2|S|, ε1 = 0.1ε,

80|S|, and η = 0.1ε2

40|S|. When Algorithm 1 is run for T iterations with the reward functional Hσ, where:

0.1ε2 log log |S|

we have that:

H(dπmix,T ) max π Π H(dπ) ε .

We continue with the proof of the main theorem.

Proof of Theorem 4.1. Let π be a maximum-entropy policy, ie. π arg maxπ Π R(dπ).

R(dπmix,t+1) = R((1 η)dπmix,t + ηdπt+1) Equation 2.6

R(dπmix,t) + η dπt+1 dπmix,t, R(dπmix,t)

η2β dπt+1 dπmix,t 2 2 smoothness

The second inequality follows from the smoothness4 of R. To incorporate the error due to the two oracles, observe

dπt+1, R(dπmix,t)

dπt+1, R( ˆdπmix,t) β dπmix,t ˆdπmix,t

dπ , R( ˆdπmix,t) βε0 ε1 dπ , R(dπmix,t) 2βε0 ε1

The ﬁrst and last inequalities invoke the assumptions laid out in Equation 4.3. Note that the second inequality above follows from the deﬁning character of the planning oracle, ie. with respect to the reward vector rt = R( ˆdπmix,t), for any policy π Π, it holds true that

Vπt+1 = dπt+1, rt Vπ ε1 = dπ , rt ε1

In particular, this statement holds5 for the choice π = π .

Using the above fact and continuing on

R(dπmix,t+1)

R(dπmix,t) + η dπ dπmix,t, R(dπmix,t)

2ηβε0 ηε1 η2β

(1 η)R(dπmix,t) + ηR(dπ )

2ηβε0 ηε1 η2β

4See Section 2.1 in (Bubeck et al., 2015) for equivalent deﬁnitions of smoothness in terms of the function value and the Hessian. 5Even when when Π is chosen to be set of all stationary policies, this argument does not rely on π being a stationary policy, since πt+1 is an optimal policy for the reward function rt among the class of all policies.

Provably Efﬁcient Maximum Entropy Exploration

The last step here utilizes the concavity of R. Indeed, the inequality follows immediately from the sub-gradient characterization of concave functions. Now, with the aid of the above, we observe the following inequality.

R(dπ ) R(dπmix,t+1)

(1 η)(R(dπ ) R(dπmix,t)) + 2ηβε0 + ηε1 + η2β.

Telescoping the inequality, this simpliﬁes to

R(dπ ) R(dπmix,T )

(1 η)T (R(dπ ) R(dπmix,0)) + 2βε0 + ε1 + ηβ

Be T η + 2βε0 + ε1 + ηβ.

Setting ε1 = 0.1ε, ε0 = 0.1β 1ε, η = 0.1β 1ε, T = η 1 log 10Bε 1 sufﬁces.

4.1. Tabular setting

In general, the construction of provably computationally efﬁcient approximate planning oracle for MDPs with large or continuous state spaces poses a challenge. Discounting limited settings (eg. the Linear Quadratic Regulators (Bertsekas, 2005), (Fazel et al., 2018)), one may only appeal to the recent empirical successes of sample-based planning algorithms that rely on the power of non-linear function approximation.

Nevertheless, one may expect, and possibly require, that any solution proposed to address the general case performs reasonably when restricted to the tabular setting. In this spirit, we outline the construction of the required oracles in the tabular setting, where we consider the exhaustive class of policies.

4.1.1. THE KNOWN MDP CASE

With the knowledge of the transition matrix P of a MDP M in the form of an explicit tensor, the planning oracle can be implemented via any of the exact solution methods (Bertsekas, 2005), eg. value iteration, linear programming. The state distribution oracle can be efﬁciently implemented as Lemma 3.2 suggests.

Corollary 4.3. When the MDP M is known explicitly, with the oracles described in Section 4, Algorithm 1 runs in poly β, |S|, |A|, 1 1 γ , 1

ε, log B time to guarantee

R(dπmix,T ) maxπ R(dπ) ε.

4.1.2. THE UNKNOWN MDP CASE

For the case of an unknown MDP, a sample-based algorithm must iteratively try to learn about the MDP through its interactions with the environment. Here, we assume a γ-discounted episodic setting, where the agent can act in the environment starting from s0 d0 for some number of

steps, and is then able to reset. Our measure of sample complexity in this setting is the number of O (1 γ) 1 -length episodes the agent must sample to achieve a ε-suboptimal performance guarantee.

The algorithm outlined below makes a distinction between the set of states it is (relatively) sure about and the set of states that have not been visited enough number of times yet. The algorithm and the analysis is similar to the E3

algorithm (Kearns & Singh, 2002). Since algorithms like E3 proceed by building a relatively accurate model on the set of reachable states, as opposed to estimate of the value functions, this permits the reuse of information across different invocations, each of which might operate on a different reward signal.

Theorem 4.4. For an unknown MDP, with Algorithm 2 as the planning oracle and Algorithm 3 as the distribution estimate oracle, Algorithm 1 runs in poly β, |S|, |A|, 1 1 γ , 1

time and executes O B3|S|2|A|

ε3(1 γ)2 + β3

ε3 episodes of length

O log |S|ε 1

log γ 1 to guarantee that

R(dπmix,T ) max π R(dπ) ε.

A sub-optimality bound may be derived on the non-smooth entropy functional H via Lemma A.1. Again, the extraneous factors introduced in the process are a consequence of the imposed smoothing via Hσ.

Corollary 4.5. For an unknown MDP, with Algorithm 2 as the planning oracle and Algorithm 3 as the distribution estimate oracle and Hσ as the proxy reward functional, Algorithm 1 runs in poly |S|, |A|, 1 1 γ , 1

ε time and executes

O |S|2|A| ε3(1 γ)2 + |S|3

ε6 episodes of length O log |S|ε 1

log γ 1 to guarantee that

H(dπmix,T ) max π H(dπ) ε.

Before we state the proof, we note the following lemmas. The ﬁrst is an adaptation of the analysis of the E3 algorithm. The second is standard. We only include the second for completeness. The proofs of these may be found in the appendix.

Lemma 4.6. For any reward function r with r B,

ε > 0, with ε1 = 0.1B 1ε, m = 32B2|S| log 2|S|

δ (1 γ)2(0.1ε)2 , n =

B log 32|S|2|A| log 2|S|

δ (1 γ)2(0.1ε)2δ

0.1ε , t0 = log 0.1ε log |S| log γ , Algorithm 4.4 guarantees with probability 1 δ

Vπ max π Vπ ε.

Furthermore, note that if Algorithm 4.4 is invoked T times (on possibly different reward functions), the total number

Provably Efﬁcient Maximum Entropy Exploration

Algorithm 2 Sample-based planning for an unknown MDP.

1: Input: Reward r, error tolerance ε > 0, exact planning oracle tolerance ε1 > 0, oversampling parameter m, number of rollouts n, rollout length t0. 2: Initialize a persistent data structure C R|S|2 |A|, which is maintained across different calls to the planning algorithm to keep transition counts, to C(s |s, a) = 0 for every (s , s, a) S2 A. 3: repeat 4: Declare K = {s : mina A P

s S C(s |s, a) m},

ˆP(s |s, a) =

( C(s |s,a) P

s S C(s |s,a), if s K

1s =s. otherwise. 5: Deﬁne the reward function as r K(s) = ( r(s), if s K B. otherwise.

6: Compute an (approximately) optimal policy on the MDP induced by ˆP and reward r K. This task is purely computational, and can be done as indicated in Section 4.1.1. Also, modify the policy so that on every state s S K, it chooses the least performed action.

( (Π (r K, ε1))(s) if s K, argmina A P

s S C(s |s, a) otherwise

7: Run π on the true MDP M to obtain n independently sampled t0-length trajectories (τ1, . . . τn), and increment the corresponding counts in C(s |s, a). 8: If and only if no trajectory τi contains a state s S K, mark π as stable. 9: until π is stable. 10: return π.

Algorithm 3 Sample-based estimate of the state distribution.

1: Input: A policy π, termination length t0, oversampling parameter m. 2: Sample m trajectories (τ0, . . . τm 1) of length t0 following the policy π. 3: For every t < t0, calculate the empirical state distribution ˆdt,π.

dt,π(s) = |{i < m : τi = (s0, a0, . . . ) with st = s}|

4: return ˆdπ = 1 γ 1 γt0 Pt0 1 t=0 γt ˆdt,π

of episodes sampled across all the invocations is n(T +

m|S||A|) = O BT

ε + B3|S|2|A|

ε3(1 γ)2 , each episode being of length t0.

Lemma 4.7. For any ε0, δ > 0, when Algorithm 3 is run with m = 200

ε2 0 log 2|S| log 0.1ε

δ log γ , t0 = log 0.1ε0

log γ , ˆdπ satisﬁes

ˆdπ dπ ε0 with probability at least 1 δ. In this process, the algorithm samples m episodes of length t0.

Proof of Theorem 4.4. The claim follows immediately from the invocations of the two lemmas above with the parameter settings proposed in Theorem 4.1.

5. Proof of Concept Experiments

We report the results from a preliminary set of experiments6. In each case, the Max Ent agent learns to access the set of reachable states within a small number of iterations, while monotonically increasing the entropy of the induced state distribution.

Recall that Algorithm 1 requires access to an approximate planning oracle and a density estimator for the induced distribution. In most of the experimental environments, the density estimator is deliberately chosen to be simple a count-based estimate over the discretized state space. It is possible to use neural density estimators and other functionapproximation based estimators in its stead, as we do for the Humanoid environment.

5.1. Environments and density estimation

Pendulum. The 2-dimensional state space for Pendulum (from (Brockman et al., 2016)) was discretized evenly to a grid of dimension 8 8. The dimensions represented angular position and angular velocity. For Pendulum, the maximum torque and velocity were capped at 1.0 and 7.0, respectively.

Ant. The 29-dimensional state space for Ant (with a Mujoco engine) was ﬁrst reduced to dimension 7, combining the agent s x and y location in the gridspace with a 5-dimensional random projection of the remaining 27 states. The x and y dimensions were discretized into 16 bins in the range [ 12, 12]. The other dimensions were each normalized and discretized into 15 bins in the range [ 1, 1]. While the planning agent agent had access to the full state representation, the density estimation was performed exclusively on the reduced representation.

Humanoid. The 376-dimensional state space for the Mujoco Humanoid environment was too large and complex to effectively discretize without signiﬁcantly affecting the accuracy of estimation. In view of this, the agent s state space density was estimated using kernel density estimation (from scikit-learn (Pedregosa et al., 2011)), with an Epanechnikov kernel with a bandwidth of 0.10.

6The open-source implementations may be found at https: //github.com/abbyvansoest/maxent.

Provably Efﬁcient Maximum Entropy Exploration

Pendulum Ant Humanoid

Figure 2. Results of the preliminary experiments. In each plot, blue represents the Max Ent agent, and orange represents the random baseline. 2a, 2b, and 2c show the entropy of the policy evolving with the number of epochs. 2d, 2e, and 2f show the log-probability of occupancy of the two-dimensional state space. In 2e and 2f, the inﬁnite xy grid is limited to [ 20, 20] [ 20, 20].

Figure 3. The number of distinct xy states visited by Ant at various epochs. Results were averaged over N = 20 executions. As the number of policies in the mixture increases, the agent reaches more unique states in the same amount of time.

5.2. Algorithmic details

Reward functional. Each planning agent was trained to maximize a smooth variant of the KL divergence objective.

KLσ(Unif||dπ) = E s Unif log(dπ(s) + σ) + C .

Such a choice encourages visitation of novel states more aggressively than the entropy objective. The smoothing parameter was chosen to be σ = |S| 1

Pendulum. The planning oracle is a REINFORCE (Sutton et al., 2000) agent, where the the output policy from the previous iteration is used as the initial policy for the next iteration. The policy class is a neural net with a single hidden layer consisting of 128 units. The agent is trained on 200 episodes every epoch. The baseline agent chooses its action randomly at every time step.

Ant. The planning oracle is a Soft Actor-Critic (Haarnoja et al., 2018) agent. The policy class is a neural net with 2 hidden layers composed of 300 units and the Re LU activation function. The agent is trained for 30 episodes, each of which consists of a roll-out of 5000 steps. The mixed policy is executed over 10 trials of T = 10000 steps at the end of each epoch in order to approximate the policy distribution and compute the next reward function. The baseline agent chooses its actions randomly for the same number of trials and steps.

Humanoid. The planning oracle is a Soft Actor-Critic agent. The policy class is a neural net with 2 hidden layers composed of 300 units and the Re LU activation function. The agent is trained for 30 episodes, each of which consists of a roll-out of 5000 steps. The mixed policy is executed for 20 trials of T = 50, 000 steps to collect input data that is used to ﬁt the kernel estimation model. This model is then used to estimate the induced state density of the mixed policy.

Provably Efﬁcient Maximum Entropy Exploration

Acknowledgements

Sham Kakade acknowledges funding from the Washington Research Foundation for Innovation in Data-intensive Discovery, the DARPA award FA8650-18-2-7836, and the ONR award N00014-18-1-2247. The authors thank Shie Mannor for helpful discussions.

Abbeel, P. and Ng, A. Y. Apprenticeship learning via inverse reinforcement learning. In ICML, 2004.

Azar, M. G., Osband, I., and Munos, R. Minimax regret bounds for reinforcement learning. ar Xiv preprint ar Xiv:1703.05449, 2017.

Bellemare, M. G., Srinivasan, S., Ostrovski, G., Schaul, T., Saxton, D., and Munos, R. Unifying count-based exploration and intrinsic motivation. In Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016, December 5-10, 2016, Barcelona, Spain, pp. 1471 1479, 2016.

Bertsekas, D. P. Dynamic programming and optimal control, volume 1. Athena Scientiﬁc, 2005.

Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., and Zaremba, W. Openai gym. ar Xiv preprint ar Xiv:1606.01540, 2016.

Bubeck, S. et al. Convex optimization: Algorithms and complexity. Foundations and Trends in Machine Learning, 8 (3-4):231 357, 2015.

Burda, Y., Edwards, H., Pathak, D., Storkey, A., Darrell, T., and Efros, A. A. Large-scale study of curiosity-driven learning. In ar Xiv:1808.04355, 2018a.

Burda, Y., Edwards, H., Storkey, A., and Klimov, O. Exploration by random network distillation. ar Xiv preprint ar Xiv:1810.12894, 2018b.

Chentanez, N., Barto, A. G., and Singh, S. P. Intrinsically motivated reinforcement learning. In Saul, L. K., Weiss, Y., and Bottou, L. (eds.), Advances in Neural Information Processing Systems 17, pp. 1281 1288. MIT Press, 2005.

Dann, C. and Brunskill, E. Sample complexity of episodic ﬁxed-horizon reinforcement learning. In Advances in Neural Information Processing Systems, pp. 2818 2826, 2015.

De Farias, D. P. and Van Roy, B. The linear programming approach to approximate dynamic programming. Operations research, 51(6):850 865, 2003.

Fazel, M., Ge, R., Kakade, S., and Mesbahi, M. Global convergence of policy gradient methods for the linear quadratic regulator. In International Conference on Machine Learning, pp. 1466 1475, 2018.

Frank, M. and Wolfe, P. An algorithm for quadratic programming. Naval Research Logistics (NRL), 3(1-2):95 110, 1956.

Fu, J., Co-Reyes, J., and Levine, S. Ex2: Exploration with exemplar models for deep reinforcement learning. In Guyon, I., Luxburg, U. V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 30, pp. 2577 2587. Curran Associates, Inc., 2017.

Haarnoja, T., Zhou, A., Abbeel, P., and Levine, S. Soft actorcritic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. Co RR, abs/1801.01290, 2018.

Houthooft, R., Chen, X., Chen, X., Duan, Y., Schulman, J., De Turck, F., and Abbeel, P. Vime: Variational information maximizing exploration. In Lee, D. D., Sugiyama, M., Luxburg, U. V., Guyon, I., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 29, pp. 1109 1117. Curran Associates, Inc., 2016.

Kaelbling, L. P. Learning to achieve goals. In Proceedings of the Thirteenth International Joint Conference on Artiﬁcial Intelligence, Chambery, France, 1993. Morgan Kaufmann.

Kakade, S. M. On the sample complexity of reinforcement learning. Ph D thesis, University of London London, England, 2003.

Kearns, M. and Singh, S. Near-optimal reinforcement learning in polynomial time. Machine learning, 49(2-3):209 232, 2002.

Lattimore, T. and Hutter, M. Near-optimal pac bounds for discounted mdps. Theoretical Computer Science, 558: 125 143, 2014.

Lim, S. H. and Auer, P. Autonomous exploration for navigating in mdps. In Conference on Learning Theory, pp. 40 1, 2012.

Lopes, M., Lang, T., Toussaint, M., and yves Oudeyer, P. Exploration in model-based reinforcement learning by empirically estimating learning progress. In Pereira, F., Burges, C. J. C., Bottou, L., and Weinberger, K. Q. (eds.), Advances in Neural Information Processing Systems 25, pp. 206 214. Curran Associates, Inc., 2012.

Provably Efﬁcient Maximum Entropy Exploration

Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., Graves, A., Riedmiller, M., Fidjeland, A. K., Ostrovski, G., et al. Human-level control through deep reinforcement learning. Nature, 518(7540): 529, 2015.

Mohamed, S. and Jimenez Rezende, D. Variational information maximisation for intrinsically motivated reinforcement learning. In Cortes, C., Lawrence, N. D., Lee, D. D., Sugiyama, M., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 28, pp. 2125 2133. Curran Associates, Inc., 2015.

Nair, A., Pong, V., Dalal, M., Bahl, S., Lin, S., and Levine, S. Visual reinforcement learning with imagined goals. Co RR, abs/1807.04742, 2018.

Neu, G., Jonsson, A., and G omez, V. A uniﬁed view of entropy-regularized markov decision processes. ar Xiv preprint ar Xiv:1705.07798, 2017.

Ng, A. Y., Harada, D., and Russell, S. J. Policy invariance under reward transformations: Theory and application to reward shaping. In ICML, 1999.

Osband, I., Blundell, C., Pritzel, A., and Van Roy, B. Deep exploration via bootstrapped dqn. In Lee, D. D., Sugiyama, M., Luxburg, U. V., Guyon, I., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 29, pp. 4026 4034. Curran Associates, Inc., 2016.

Osband, I., Aslanides, J., and Cassirer, A. Randomized prior functions for deep reinforcement learning. In Bengio, S., Wallach, H., Larochelle, H., Grauman, K., Cesa-Bianchi, N., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 31, pp. 8625 8637. Curran Associates, Inc., 2018.

Ostrovski, G., Bellemare, M. G., van den Oord, A., and Munos, R. Count-based exploration with neural density models. In Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6-11 August 2017, pp. 2721 2730, 2017.

Pathak, D., Agrawal, P., Efros, A. A., and Darrell, T. Curiosity-driven exploration by self-supervised prediction. In ICML, 2017.

Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., and Duchesnay, E. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825 2830, 2011.

Peters, J., Mulling, K., and Altun, Y. Relative entropy policy search. In Twenty-Fourth AAAI Conference on Artiﬁcial Intelligence, 2010.

Puterman, M. L. Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons, 2014.

Ross, S., Gordon, G. J., and Bagnell, J. A. A reduction of imitation learning and structured prediction to no-regret online learning. In AISTATS, 2011.

Savinov, N., Raichuk, A., Marinier, R., Vincent, D., Pollefeys, M., Lillicrap, T. P., and Gelly, S. Episodic curiosity through reachability. Co RR, abs/1810.02274, 2018.

Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al. Mastering the game of go with deep neural networks and tree search. nature, 529(7587):484, 2016.

Singh, S., Lewis, R. L., and Barto, A. G. Where do rewards come from? Proceedings of the Annual Conference of the Cognitive Science Society (Cog Sci), 2009.

Singh, S. P., Lewis, R. L., Barto, A. G., and Sorg, J. Intrinsically motivated reinforcement learning: An evolutionary perspective. IEEE Transactions on Autonomous Mental Development, 2:70 82, 2010.

Strehl, A. L., Li, L., Wiewiora, E., Langford, J., and Littman, M. L. Pac model-free reinforcement learning. In Proceedings of the 23rd international conference on Machine learning, pp. 881 888. ACM, 2006.

Sutton, R. S., Mc Allester, D. A., Singh, S. P., and Mansour, Y. Policy gradient methods for reinforcement learning with function approximation. In Advances in neural information processing systems, pp. 1057 1063, 2000.

Szita, I. and Szepesv ari, C. Model-based reinforcement learning with nearly tight exploration complexity bounds. In Proceedings of the 27th International Conference on Machine Learning (ICML-10), pp. 1031 1038, 2010.

Tang, H., Houthooft, R., Foote, D., Stooke, A., Xi Chen, O., Duan, Y., Schulman, J., De Turck, F., and Abbeel, P. #exploration: A study of count-based exploration for deep reinforcement learning. In Guyon, I., Luxburg, U. V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 30, pp. 2753 2762. Curran Associates, Inc., 2017.

Todorov, E. Linearly-solvable markov decision problems. In Advances in neural information processing systems, pp. 1369 1376, 2007.

Warde-Farley, D., de Wiele, T. V., Kulkarni, T., Ionescu, C., Hansen, S., and Mnih, V. Unsupervised control through non-parametric discriminative rewards. Co RR, abs/1811.11359, 2018.

Provably Efﬁcient Maximum Entropy Exploration

Weber, T., Racani ere, S., Reichert, D. P., Buesing, L., Guez, A., Rezende, D. J., Badia, A. P., Vinyals, O., Heess, N., Li, Y., et al. Imagination-augmented agents for deep reinforcement learning. ar Xiv preprint ar Xiv:1707.06203, 2017.

Zheng, Z., Oh, J., and Singh, S. On learning intrinsic rewards for policy gradient methods. Co RR, abs/1804.06459, 2018.