# sampleefficient_reinforcement_learning_of_undercomplete_pomdps__51e0eb72.pdf

Sample-Efﬁcient Reinforcement Learning of Undercomplete POMDPs

Chi Jin Princeton University chij@princeton.edu

Sham M. Kakade University of Washington Microsoft Research, NYC sham@cs.washington.edu

Akshay Krishnamurthy Microsoft Research, NYC akshaykr@microsoft.com

Qinghua Liu Princeton University qinghual@princeton.edu

Partial observability is a common challenge in many reinforcement learning applications, which requires an agent to maintain memory, infer latent states, and integrate this past information into exploration. This challenge leads to a number of computational and statistical hardness results for learning general Partially Observable Markov Decision Processes (POMDPs). This work shows that these hardness barriers do not preclude efﬁcient reinforcement learning for rich and interesting subclasses of POMDPs. In particular, we present a sample-efﬁcient algorithm, OOM-UCB, for episodic ﬁnite undercomplete POMDPs, where the number of observations is larger than the number of latent states and where exploration is essential for learning, thus distinguishing our results from prior works. OOM-UCB achieves an optimal sample complexity of O(1/ε2) for ﬁnding an ε-optimal policy, along with being polynomial in all other relevant quantities. As an interesting special case, we also provide a computationally and statistically efﬁcient algorithm for POMDPs with deterministic state transitions.

1 Introduction

In many sequential decision making settings, the agent lacks complete information about the underlying state of the system, a phenomenon known as partial observability. Partial observability significantly complicates the tasks of reinforcement learning and planning, because the non-Markovian nature of the observations forces the agent to maintain memory and reason about beliefs of the system state, all while exploring to collect information about the environment. For example, a robot may not be able to perceive all objects in the environment due to occlusions, and it must reason about how these objects may move to avoid collisions [10]. Similar reasoning problems arise in imperfect information games [8], medical diagnosis [13], and elsewhere [25]. Furthermore, from a theoretical perspective, well-known complexity-theoretic results show that learning and planning in partially observable environments is statistically and computationally intractable in general [23, 22, 30, 21].

The standard formulation for reinforcement learning with partial observability is the Partially Observable Markov Decision Process (POMDP), in which an agent operating on noisy observations makes decisions that inﬂuence the evolution of a latent state. The complexity barriers apply for this model, but they are of a worst case nature, and they do not preclude efﬁcient algorithms for interesting sub-classes of POMDPs. Thus we ask:

Can we develop efﬁcient algorithms for reinforcement learning in large classes of POMDPs?

34th Conference on Neural Information Processing Systems (Neur IPS 2020), Vancouver, Canada.

This question has been studied in recent works [3, 12], which incorporate a decision making component into a long line of work on spectral methods for estimation in latent variable models [14, 29, 2, 1], including the Hidden Markov Model. Brieﬂy, these estimation results are based on the method of moments, showing that under certain assumptions the model parameters can be computed by a decomposition of a low-degree moment tensor. The works of Azizzadenesheli et al. [3] and Guo et al. [12] use tensor decompositions in the POMDP setting and obtain sample efﬁciency guarantees. Neither result considers a setting where strategic exploration is essential for information acquisition, and they do not address one of the central challenges in more general reinforcement learning problems.

Our contributions. In this work, we provide new sample-efﬁcient algorithms for reinforcement learning in ﬁnite POMDPs in the undercomplete regime, where the number of observations is larger than the number of latent states. This assumption is quite standard in the literature on estimation in latent variable models [1]. Our main algorithm OOM-UCB uses the principle of optimism for exploration and uses the information gathered to estimate the Observable Operators induced by the environment. Our main result proves that OOM-UCB ﬁnds a near optimal policy for the POMDP using a number of samples that scales polynomially with all relevant parameters and additionally with the minimum singular value of the emission matrix. Notably, OOM-UCB ﬁnds an ε-optimal policy at the optimal rate of O(1/ε2).

While OOM-UCB is statistically efﬁcient for this subclass of POMDPs, we should not expect it to be computationally efﬁcient in general, as this would violate computational barriers for POMDPs. However, in our second contribution, we consider a further restricted subclass of POMDPs in which the latent dynamics are deterministic and where we provide both a computationally and statistically efﬁcient algorithm. Notably, deterministic dynamics are still an interesting subclass due to that, while it avoids computational barriers, it still does not mitigate the need for strategic exploration. We prove that our second algorithm has sample complexity scaling with all the relevant parameters as well as the minimum ℓ2 distance between emission distributions. This latter quantity replaces the minimum singular value in the guarantee for OOM-UCB and is a more favorable dependency.

We provide further motivation for our assumptions with two lower bounds: the ﬁrst shows that the overcomplete setting is statistically intractable without additional assumptions, while the second necessitates the dependence on the minimum singular value of the emission matrix. In particular, under our assumptions, the agent must engage in strategic exploration for sample-efﬁciency. As such, the main conceptual advance in our line of inquiry over prior works is that our algorithms address exploration and partial observability in a provably efﬁcient manner.

1.1 Related work

A number of computational barriers for POMDPs are known. If the parameters are known, it is PSPACE-complete to compute the optimal policy, and, furthermore, it is NP-hard to compute the optimal memoryless policy [23, 30]. With regards to learning, Mossel and Roch [21] provided an average case computationally complexity result, showing that parameter estimation for a subclass of Hidden Markov Models (HMMs) is at least as hard as learning parity with noise. This directly implies the same hardness result for parameter estimation in POMDP models, due to that an HMM is just a POMDP with a ﬁxed action sequence. On the other hand, for reinforcement learning in POMDPs (in particular, ﬁnding a near optimal policy), one may not need to estimate the model, so this lower bound need not directly imply that the RL problem is computational intractable. In this work, we do provide a lower bound showing that reinforcement learning in POMDPs is both statistically and computationally intractable (Propositions 1 and 2).

On the positive side, there is a long history of work on learning POMDPs. [11] studied POMDPs without resets, where the proposed algorithm has sample complexity scaling exponentially with a certain horizon time, which is not possible to relax without further restrictions. [26, 24] proposed to learn POMDPs using Bayesian methods; PAC or regret bounds are not known for these approaches. [18] studied policy gradient methods for learning POMDPs while they considered only Markovian policies and did not address exploration.

Closest to our work are POMDP algorithms based on spectral methods [12, 3], which were originally developed for learning latent variable models [14, 2, 1, 29, 28]. These works give PAC and regret bounds (respectively) for tractable subclasses of POMDPs, but, in contrast with our work, they make

additional assumptions to mitigate the exploration challenge. In [12], it is assumed that all latent states can be reached with nontrivial probability with a constant number of random actions. This allows for estimating the entire model without sophisticated exploration. [3] consider a special class of memoryless policies in a setting where all of these policies visit every state and take every action with non-trivial probability. As with [12], this restriction guarantees that the entire model can be estimated regardless of the policy executed, so sophisticated exploration is not required. We also mention that [12, 3] assume that both the transition and observation matrices are full rank, which is stronger than our assumptions. We do not make any assumptions on the transition matrix.

Finally, the idea of representing the probability of a sequence as products of operators dates back to multiplicity automata [27, 9] and reappeared in the Observable Operator Model (OOMs) [16] and Predictive State Representations (PSRs) [20]. While spectral methods have been applied to PSRs [7], we are not aware of results with provable guarantees using this approach. It is also worth mentioning that any POMDP can be modeled as an Input-Output OOM [15].

2 Preliminaries

In this section, we deﬁne the partially observable Markov decision process, the observable operator model [16], and discuss their relationship.

Notation. For any natural number n N, we use [n] to denote the set {1, 2, . . . , n}. We use bold upper-case letters B to denote matrices and bold lower-case letters b to denote vectors. Bij means the (i, j)th entry of matrix B and (B)i represents its ith column. For vectors we use p to denote the ℓp-norm, and for matrices we use , 1 and F to denote the spectral norm, entrywise ℓ1-norm and Frobenius norm respectively. We denote by B p q = max v p 1 Bv q the p-to-q norm of B. For any matrix B Rm n, we use σmin(B) to denote its smallest singular value, and B Rn m to denote its Moore-Penrose inverse. For vector v Rn, we denote diag(v) Rn n as a diagonal matrix where [diag(v)]ii = vi for all i [n]. Finally, we use standard big-O and big-Omega notation O( ), Ω( ) to hide only absolute constants which do not depend on any problem parameters, and notation O( ), Ω( ) to hide only absolute constants and logarithmic factors.

2.1 Partially observable Markov decision processes

We consider an episodic tabular Partially Observable Markov Decision Process (POMDP), which can by speciﬁed as POMDP(H, S , A , O, T, O, r, µ1). Here H is the number of steps in each episode, S is the set of states with |S | = S, A is the set of actions with |A | = A, O is the set of observations with |O| = O, T = {Th}H h=1 specify the transition dynamics such that Th( |s, a) is the distribution over states if action a is taken from state s at step h [H], O = {Oh}H h=1 are emissions such that Oh( |s) is the distribution over observations for state s at step h [H], r = {rh : O [0, 1]}H h=1 are the known deterministic reward functions1, and µ1( ) is the initial distribution over states. Note that we consider nonstationary dynamics, observations, and rewards.

In a POMDP, states are hidden and unobserved to the learning agent. Instead, the agent is only able to see the observations and its own actions. At the beginning of each episode, an initial hidden state s1 is sampled from initial distribution µ1. At each step h [H], the agent ﬁrst observes oh O which is generated from the hidden state sh S according to Oh( |sh), and receives the reward rh(oh), which can be computed from the observation oh. Then, the agent picks an action ah A , which causes the environment to transition to hidden state sh+1, that is drawn from the distribution Th( |sh, ah). The episode ends when o H is observed.

A policy π is a collection of H functions πh : Th A

h [H], where Th = (O A )h 1 O is the set of all possible histories of length h. We use V π R to denote the value of policy π, so that V π gives the expected cumulative reward received under policy π:

V π := Eπ h PH h=1 rh(oh) i .

1Since rewards are observable in most applications, it is natural to assume the reward is a known function of the observation. While we study deterministic reward functions for notational simplicity, our results generalize to randomized reward functions. Also, we assume the reward is in [0, 1] without loss of generality.

Since the state, action, observation spaces, and the horizon, are all ﬁnite, there always exists an optimal policy π which gives the optimal value V = supπ V π. We remark that, in general, the optimal policy of a POMDP will select actions based the entire history, rather than just the recent observations and actions. This is one of the major differences between POMDPs and standard Markov Decision Processes (MDPs), where the optimal policies are functions of the most recently observed state. This difference makes POMDPs signiﬁcantly more challenging to solve.

The POMDP learning objective. Our objective in this paper is to learn an ε-optimal policy ˆπ in the sense that V ˆπ V ε, using a polynomial number of samples.

2.2 The observable operator model

We have described the POMDP model via the transition and observation distributions T, O and the initial distribution µ1. While this parametrization is natural for describing the dynamics of the system, POMDPs can also be fully speciﬁed via a different set of parameters: a set of operators {Bh(a, o) RO O}h [H 1],a A ,o O, and a vector b0 RO.

In the undercomplete setting where S O and where observation probability matrices {Oh RO S}h [H] are all full column-rank, the operators {Bh(a, o)}h,a,o and vector b0 can be expressed in terms of (T, O, µ1) as follows:

Bh(a, o) = Oh+1Th(a)diag(Oh(o| ))O h, b0 = O1µ1. (1)

where we use the matrix and vector notation for Oh RO S and µ1 RS here, such that [Oh]o,s = Oh(o|s) and [µ1]s = µ1(s). Th(a) RS S denotes the transition matrix given action a A where [Th(a)]s ,s = Th(s |s, a), and Oh(o| ) RS denotes the o-th row in matrix Oh with [Oh(o| )]s = Oh(o|s). Note that the matices deﬁned in (1) have rank at most S. Using these matrices Bh, it can be shown that (Appendix E.1), for any sequence of (o H, . . . , a1, o1) O (A O)H 1, we have:

P(o H, . . . , o1|a H 1, . . . , a1) = e o H BH 1(a H 1, o H 1) B1(a1, o1) b0. (2)

Describing these conditional probabilities for every sequence is sufﬁcient to fully specify the entire dynamical system. Therefore, as an alternative to directly learning T, O and µ1, it is also sufﬁcient to learn operators {Bh(a, o)}h,a,o and vector b0 in order to learn the optimal policy. The latter approach enjoys the advantage that (2) does not explicitly involve latent variables. It refers only to observable quantities actions and observations.

We remark that the operator model introduced in this section (which is parameterized by {Bh(a, o)}h,a,o and b0) bears signiﬁcant similarity to Jaeger s Input-Output Observable Operator Model (IO-OOM) [16], except a few minor technical differences.2 With some abuse of terminology, we also refer to our model as Observable Operator Model (OOM) in this paper. It is worth noting that Jaeger s IO-OOMs are strictly more general than POMDPs [16] and also includes overcomplete POMDPs via a relation different from (1). Since our focus is on undercomplete POMDPs, we refer the reader to [16] for more details.

3 Main Results

We ﬁrst state our main assumptions, which we motivate with corresponding hardness results in their absence. We then present our main algorithm, OOM-UCB, along with its sample efﬁciency guarantee.

3.1 Assumptions

In this paper, we make the following assumptions. Assumption 1. We assume the POMDP is undercomplete, i.e. S O. We also assume the minimum singular value of the observation probability matrices σmin(Oh) α > 0 for all h [H].

Both assumptions are standard in the literature on learning Hidden Markov Models (HMMs) an uncontrolled version of POMDP [see e.g., 2]. The second assumption that σmin(Oh) is lower-bounded

2Jaeger s IO-OOM further requires the column-sums of operators to be 1.

Algorithm 1 Observable Operator Model with Upper Conﬁdence Bound (OOM-UCB)

1: Initialize: set all entries in a vector of counts n NO, and in matrices of counts Nh(a, a) NO O, Mh(o, a, a) NO O to be zero for all (o, a, a) O A 2

2: set conﬁdence set Θ1 h [H]{ˆθ | σmin(ˆOh) α}. 3: for k = 1, 2, . . . , K do 4: compute the optimistic policy πk argmaxπ maxˆθ Θk V π(ˆθ). 5: observe o1, and set n n + eo1 6: b ( h [H]{ˆθ | σmin(ˆOh) α}) {ˆθ | k b0(ˆθ) n 2 βk}. 7: for (h, a, a) [H 1] A 2 do 8: execute policy πk from step 1 to step h 2. 9: take action a at step h 1, and action a at step h respectively. 10: observe (oh 1, oh, oh+1), and set Nh(a, a) Nh(a, a) + eohe oh 1.

11: set Mh(oh, a, a) Mh(oh, a, a) + eoh+1e oh 1.

12: Bh(a, a) o O{ˆθ | Bh(a, o; ˆθ)Nh(a, a) Mh(o, a, a) F γk}. 13: construct the conﬁdence set Θk+1 [ (h,a, a) [H 1] A 2Bh(a, a)] b. 14: Output: πk where k is sampled uniformly from [K].

is a robust version of the assumption that Oh RO S is full column-rank, which is equivalent to σmin(Oh) > 0. Together, these assumption ensure that the observations will contain a reasonable amount of information about the latent states.

We do not assume that the initial distribution µ1 has full support, nor do we assume the transition probability matrices Th are full rank. In fact, Assumption 1 is not sufﬁcient for identiﬁcation of the system, i.e. recovering parameters T, O, µ1 in total-variance distance. Exploration is crucial to ﬁnd a near-optimal policy in our setting.

We motivate both assumptions above by showing that, with absence of either one, learning a POMDP is statistically intractable. That is, it would require an exponential number of samples for any algorithm to learn a near-optimal policy with constant probability. Proposition 1. For any algorithm A, there exists an overcomplete POMDP (S > O) with S and O being small constants, which satisﬁes σmin(Oh) = 1 for all h [H], such that algorithm A requires at least Ω(AH 1) samples to ensure learning a (1/4)-optimal policy with probability at least 1/2.

Proposition 2. For any algorithm A, there exists an undercomplete POMDP (S O) with S and O being small constants, such that algorithm A requires at least Ω(AH 1) samples to ensure learning a (1/4)-optimal policy with probability at least 1/2.

Proposition 1 and 2 are both proved by constructing hard instances, which are modiﬁcations of classical combinatorial locks for MDPs [19]. We refer readers to Appendix B for more details.

3.2 Algorithm

We are now ready to describe our algorithm. Assumption 1 enables the representation of the POMDP using OOM with relation speciﬁed as in Equation (1). Our algorithm, Observable Operator Model with Upper Conﬁdence Bound (OOM-UCB, algorithm 1), is an optimistic algorithm which heavily exploits the OOM representation to obtain valid uncertainty estimates of the parameters of the underlying model.

To condense notation in Algorithm 1, we denote the parameters of a POMDP as θ = (T, O, µ1). We denote V π(θ) as the value of policy π if the underlying POMDP has parameter θ. We also write the parameters of the OOM (b0(θ), Bh(a, o; θ)) as a function of parameter θ, where the dependency is speciﬁed as in (1). We adopt the convention that at the 0-th step, the observation o0 and state s0 are always set to be some ﬁxed dummy observation and state, and, starting from s0, the environment transitions to s1 with distribution µ1 regardless of what action a0 is taken.

At a high level, Algorithm 1 is an iterative algorithm that, in each iteration, (a) computes an optimistic policy and model by maximizing the value (Line 4) subject to a given conﬁdence set constraint, (b) collects data using the optimistic policy, and (c) incorporates the data into an updated conﬁdence

set for the OOM parameters (Line 5-13). The ﬁrst two parts are straightforward, so we focus the discussion on computing the conﬁdence set. We remark that in general the optimization in Line 4 may not be solved in polynomial time (see discussions of the computational complexity after Theorem 3).

First, since b0 in (1) is simply the probability over observations at the ﬁrst step, our conﬁdence set for b0 in Line 6 is simply based on counting the number of times each observation appears in the ﬁrst step and Hoeffding s concentration inequality.

Our construction of the conﬁdence sets for the operators {Bh(a, o)}h,a,o is inspired by the methodof-moments estimator in HMM literature [14]. Consider two ﬁxed actions a, a, and an arbitrary distribution over sh 1. Let Ph(a, a), Qh(o, a, a) RO O be the probability matrices such that

[Ph(a, a)]o ,o =P(oh = o , oh 1 = o |ah = a, ah 1 = a),

[Qh(o, a, a)]o ,o =P(oh+1 = o , oh = o, oh 1 = o |ah = a, ah 1 = a). (3)

It can be veriﬁed that Bh(a, o)Ph(a, a) = Qh(o, a, a) (Fact 17 in the appendix). Our conﬁdence set construction (Line 12 in Algorithm 1) is based on this fact: we replace the probability matrices P, Q by empirical estimates N, M, and we use concentration inequalities to determine the width of the conﬁdence set. Finally, our overall conﬁdence set for the parameters θ is simply the intersection of the conﬁdence sets for all induced operators and b0, additionally incorporating the constraint on σmin(Oh) from Assumption 1.

3.3 Theoretical guarantees

Our OOM-UCB algorithm enjoys the following sample complexity guarantee. Theorem 3. For any ε (0, H], there exists Kmax = poly(H, S, A, O, α 1)/ε2 and an absolute constant c1, such that for any POMDP that satisﬁes Assumption 1, if we set hyperparameters βk = c1 p

k log(KAOH), γk =

Sβk/α, and K Kmax, then the output policy ˆπ of Algorithm 1 will be ε-optimal with probability at least 2/3.

Theorem 3 claims that in polynomially many iterations of the outer loop, Algorithm 1 learns a near-optimal policy for any undercomplete POMDP that satisﬁes Assumption 1. Since our algorithm only uses O(H2A2) samples per iteration of the outer loop, this implies that the sample complexity is also poly(H, S, A, O, α 1)/ε2. We remark that the 1/ε2 dependence is optimal, which follows from standard concentration arguments. To the best of our knowledge, this is the ﬁrst sample efﬁciency result for learning a class of POMDPs where exploration is essential. 3

While Theorem 3 does guarantee sample efﬁciency, Algorithm 1 is not computationally efﬁcient due to that the computation of the optimistic policy (Line 4) may not admit a polynomial time implementation, which should be expected given the aforementioned computational complexity results. We now turn to a further restricted (and interesting) subclass of POMDPs where we can address both the computational and statistical challenges.

4 Results for POMDPs with Deterministic Transition

In this section, we complement our main result by investigating the class of POMDPs with deterministic transitions, where both computational and statistical efﬁciency can be achieved. We say a POMDP is of deterministic transition if both its transition and initial distribution are deterministic, i.e, if the entries of matrices {Th}h and vector µ1 are either 0 or 1. We remark that while deterministic dynamics avoids computational barriers, it does not mitigate the need for exploration.

Instead of Assumption 1, for the deterministic transition case, we require that the columns of the observation matrices Oh are well-separated. Assumption 2. For any h [H], mins =s Oh( |s) Oh( |s ) ξ.

Assumption 2 guarantees that observation distributions for different states are sufﬁciently different, by at least ξ in Euclidean norm. It does not require that the POMDP is undercomplete, and, in fact, is strictly weaker than Assumption 1. In particular, for undercomplete models,

3See Appendix C.4 for the explicit polynomial dependence of sample complexity; here, the success probability is a constant, but one can make it arbitrarily close to 1 by a standard boosting trick (see Appendix E.3 ).

mins =s Oh( |s) Oh( |s )

2σmin(Oh), and so Assumption 1 implies Assumption 2 for ξ =

Leveraging deterministic transitions, we can design a specialized algorithm (Algorithm 2 in the appendix) that learns an ε-optimal policy using polynomially many samples and in polynomial time. We present the formal theorem here, and refer readers to Appendix D for more details.

Theorem 4. For any p (0, 1], there exists an algorithm such that for any deterministic transition

POMDP satisfying Assumption 2, within O H2SA log(HSA/p)/(min{ε/(

OH), ξ})2 samples and computations, the output policy of the algorithm is ε-optimal with probability at least 1 p.

5 Analysis Overview

In this section, we provide an overview of the proof of our main result Theorem 3. Please refer to Appendix C for the full proof.

We start our analysis by noticing that the output policy ˆπ of Algorithm 1 is uniformly sampled from {πk}K k=1 computed in the algorithm. If we can show that

(1/K) PK k=1 V V πk ε/10, (4)

then at least a 2/3 fraction of the policies in {πk}K k=1 must be ε-optimal, and uniform sampling would ﬁnd such a policy with probability at least 2/3. Therefore, our proof focuses on achieving (4).

We begin by conditioning on the event that for each iteration k, our constructed conﬁdence set Θk in fact contains the true parameters θ = (T, O, µ1) of the POMDP. This holds with high probability and is achieved by setting the widths βk and γk appropriately (see Lemma 14 in the appendix).

5.1 Bounding suboptimality in value by error in density estimation

Line 4 of Algorithm 1 computes the greedy policy πk argmaxπ maxˆθ Θk V π(ˆθ) with respect to the current conﬁdence set Θk. Let θk denote the maximizing model parameters in the k-th iteration. As (πk, θk) are optimistic, we have V V (θ ) V πk(θk) for all k [K]. Thus, for any k [K]:

V V πk V πk(θk) V πk(θ ) H X

o H,...,o1 |Pπk θk (o H, . . . , o1) Pπk θ (o H, . . . , o1)|, (5)

where Pπ θ denotes the probability measure over observations under policy π for POMDP with parameters θ. The second inequality holds because the cumulative reward is a function of observations (o H, . . . , o1) and is upper bounded by H. This upper bounds the suboptimality in value by the total variation distance between the H-step observation distributions.

Next, note that we can always choose the greedy policy πk to be deterministic, i.e., the probability to take any action given a history is either 0 or 1. This allows us to deﬁne the following set for any deterministic policy π:

Γ(π, H) := {τH = (o H, . . . , a1, o1) | π(a H 1, . . . , a1|o H, . . . , o1) = 1}.

In words, Γ(π, H) is a set of all the observation and action sequences of length H that could occur under the π. For any deterministic policy π, there is a one-to-one correspondence between OH and Γ(π, H) and moreover, for any sequence τH = (o H, . . . , a1, o1) Γ(π, H), we have:

p(τH; θ) := Pθ(o H, . . . , o1|a H 1, . . . , a1) = Pπ θ (o H, . . . , o1). (6)

The derivation of equation (6) can be found in Appendix E.2. Combining this with (5) and summing over all episodes, we conclude that:

k=1 (V V πk) H

τH Γ(πk,H) |p(τH; θk) p(τH; θ )|.

This upper bounds the suboptimality in value by errors in estimating the conditional probabilities.

5.2 Bounding error in density estimation by error in estimating operators

For the next step, we leverage the OOM representation to bound the difference between the conditional probabilities p(τH; θk) and p(τH; θ ). Recall that from (2), the conditional probability can be written as a product of the observable operators for each step and b0. Therefore, for any two parameters ˆθ and θ, we have following relation for any sequence τH = (o H, . . . , a1, o1):

p(τH; ˆθ) p(τH; θ) = e o H BH 1(a H 1, o H 1; ˆθ) B1(a1, o1; ˆθ) [b0(ˆθ) b0(θ)]

h=1 e o H BH 1(a H 1, o H 1; ˆθ) [Bh(ah, oh; ˆθ) Bh(ah, oh; θ)] B1(a1, o1; θ) b0(θ).

This relates the difference p(τH; ˆθ) p(τH; θ) to the differences in operators and b0. Formally, with further relaxation and summation over all sequence in Γ(π, H), we have the following lemma (also see Lemma 10 in Appendix C). Lemma 5. Given a deterministic policy π and two sets of undercomplete POMDP parameters θ = (O, T, µ1) and ˆθ = (ˆO, ˆT, ˆµ1) with σmin(ˆO) α, we have

τH Γ(π,H) |p(τH; ˆθ) p(τH; θ)|

b0(ˆθ) b0(θ) 1 + X

(a,o) A O [B1(a, o; ˆθ) B1(a, o; θ)]b0(θ) 1

(a, a,o) A 2 O

Bh(a, o; ˆθ) Bh(a, o; θ) Oh Th 1( a)es 1 Pπ θ (sh 1 = s)

This lemma suggests that if we could estimate the operators accurately, we would have small value sub-optimality. However, Assumption 1 is not sufﬁcient for parameter recovery. It is possible that in some step h, there exists a state sh that can be reached with only very small probability no matter what policy is used. Since we cannot collect many samples from sh, it is not possible to estimate the corresponding component in the operator Bh. In other words, we cannot hope to make Bh(a, o; ˆθ) Bh(a, o; θ) 1 small in our setting.

To proceed, it is crucial to observe that the third term on the RHS of (7), is in fact the operator error Bh(a, o; ˆθ) Bh(a, o; θ) projected onto the direction Oh Th 1( a)es and additionally reweighted by the probability of visiting state s in step h 1. Therefore, if s is hard to reach, the weighting probability will be very small, which means that even though we cannot estimate Bh(a, o; θ) accurately in the corresponding direction, it has a negligible contribution to the density estimation error (LHS of (7)).

5.3 Bounding error in estimating operators by OOM-UCB algorithm

By Lemma 5, we only need to bound the error in operators reweighted by visitation probability. This is achieved by a careful design of the conﬁdence sets in the OOM-UCB algorithm. This construction is based on the method of moments, which heavily exploits the undercompleteness of the POMDP. To showcase the main idea, we focus on bounding the third term on the RHS of (7).

Consider a ﬁxed (o, a, a) tuple, a ﬁxed step h [H], and a ﬁxed iteration k [K]. We deﬁne moment matrices Ph(a, a), Qh(o, a, a) RO O as in (3) for distribution on sh 1 that equals (1/k) Pk t=1 Pπt θ (sh 1 = ). We also denote ˆPh(a, a) = Nh(a, a)/k, ˆQh(o, a, a) = Mh(o, a, a)/k for Nh, Mh matrices after the update in the k-th iteration of Algorithm 1. By martingale concentration, it is not hard to show that with high probability:

Ph(a, a) ˆPh(a, a) F O(1/

k), Qh(o, a, a) ˆQh(o, a, a) F O(1/

Additionally, we can show that for the true operator and the true moments, we have Bh(a, o; θ )Ph(a, a) = Qh(o, a, a). Meanwhile, by the construction of our conﬁdence set Θk+1, we know that for any ˆθ Θk+1, we have

Bh(a, o; ˆθ)ˆPh(a, a) ˆQh(o, a, a) F γk/k.

Combining all relations above, we see that Bh(a, o; ˆθ) is accurate in the directions spanned by Ph(a, a), which, by deﬁnition, are directions frequently visited by the previous policies {πt}k t=1.

Formally, we have the following lemma, which allows us to bound the third term on the RHS of (7) using the algebraic transformation in Lemma 16.

Lemma 6. With probability at least 1 δ, for all k [K], for any ˆθ = (ˆO, ˆT, ˆµ1) Θk+1 and (o, a, a, h) O A 2 {2, . . . , H 1}, and ι = log(KAOH/δ), we have

Bh(a, o; ˆθ) Bh(a, o; θ ) Oh Th 1( a)es 1

t=1 Pπt θ (sh 1 = s) O

6 Conclusion

In this paper, we give a sample efﬁcient algorithm for reinforcement learning in undercomplete POMDPs. Our results leverage a connection to the observable operator model and employ a reﬁned error analysis. To our knowledge, this gives the ﬁrst provably efﬁcient algorithm for strategic exploration in partially observable environments.

Broader Impact

As this is a theoretical contribution, we do not envision that our direct results will have a tangible societal impact. Our broader line of inquiry could impact a line of thinking in a way that provides additional means to provide conﬁdence intervals relevant for planning and learning. There is an increasing needs for applications to understand planning under uncertainty in the broader context of safety and reliability, and POMDPs provide one potential framework.

Funding Disclosure

This work was supported by Microsoft and Princeton University. S.K. gratefully acknowledges funding from the ONR award N00014-18-1-2247, NSF Awards CCF-1703574 and CCF-1740551.

[1] A. Anandkumar, R. Ge, D. Hsu, S. M. Kakade, and M. Telgarsky. Tensor decompositions for learning latent variable models. Journal of Machine Learning Research, 15:2773 2832, 2014.

[2] A. Anandkumar, D. Hsu, and S. M. Kakade. A method of moments for mixture models and hidden markov models. In Conference on Learning Theory, pages 33 1, 2012.

[3] K. Azizzadenesheli, A. Lazaric, and A. Anandkumar. Reinforcement learning of pomdps using spectral methods. 29th Annual Conference on Learning Theory, 2016.

[4] S. Bazinin and G. Shani. Iterative planning for deterministic qdec-pomdps. In GCAI, pages 15 28, 2018.

[5] C. Besse and B. Chaib-Draa. Quasi-deterministic partially observable markov decision processes. In International Conference on Neural Information Processing, pages 237 246. Springer, 2009.

[6] B. Bonet. Deterministic pomdps revisited. ar Xiv preprint ar Xiv:1205.2659, 2012.

[7] B. Boots, S. M. Siddiqi, and G. J. Gordon. Closing the learning-planning loop with predictive state representations. The International Journal of Robotics Research, 30(7):954 966, 2011.

[8] N. Brown and T. Sandholm. Superhuman ai for heads-up no-limit poker: Libratus beats top professionals. Science, 359(6374):418 424, 2018.

[9] J. W. Carlyle and A. Paz. Realizations by stochastic ﬁnite automata. Journal of Computer and System Sciences, 5(1):26 40, 1971.

[10] A. R. Cassandra, L. P. Kaelbling, and J. A. Kurien. Acting under uncertainty: Discrete bayesian models for mobile-robot navigation. In Proceedings of IEEE/RSJ International Conference on Intelligent Robots and Systems. IROS 96, volume 2, pages 963 972. IEEE, 1996.

[11] E. Even-Dar, S. M. Kakade, and Y. Mansour. Reinforcement learning in pomdps without resets. 2005.

[12] Z. D. Guo, S. Doroudi, and E. Brunskill. A pac rl algorithm for episodic pomdps. In Artiﬁcial Intelligence and Statistics, pages 510 518, 2016.

[13] M. Hauskrecht and H. Fraser. Planning treatment of ischemic heart disease with partially observable markov decision processes. Artiﬁcial Intelligence in Medicine, 18(3):221 244, 2000.

[14] D. Hsu, S. M. Kakade, and T. Zhang. A spectral algorithm for learning hidden markov models. Journal of Computer and System Sciences, 78(5):1460 1480, 2012.

[15] H. Jaeger. Discrete-time, discrete-valued observable operator models: a tutorial. GMDForschungszentrum Informationstechnik, 1998.

[16] H. Jaeger. Observable operator models for discrete stochastic time series. Neural computation, 12(6):1371 1398, 2000.

[17] C. Jin, P. Netrapalli, R. Ge, S. M. Kakade, and M. I. Jordan. A short note on concentration inequalities for random vectors with subgaussian norm. ar Xiv preprint ar Xiv:1902.03736, 2019.

[18] A. Kamyar, Y. Yue, and A. Anandkumar. Policy gradient in partially observable environments: Approximation and convergence. ar Xiv preprint ar Xiv:1810.07900, 2018.

[19] A. Krishnamurthy, A. Agarwal, and J. Langford. Pac reinforcement learning with rich observations. In Advances in Neural Information Processing Systems, pages 1840 1848, 2016.

[20] M. L. Littman and R. S. Sutton. Predictive representations of state. In Advances in neural information processing systems, pages 1555 1561, 2002.

[21] E. Mossel and S. Roch. Learning nonsingular phylogenies and hidden markov models. In Proceedings of the thirty-seventh annual ACM symposium on Theory of computing, pages 366 375, 2005.

[22] M. Mundhenk, J. Goldsmith, C. Lusena, and E. Allender. Complexity of ﬁnite-horizon markov decision process problems. Journal of the ACM (JACM), 47(4):681 720, 2000.

[23] C. H. Papadimitriou and J. N. Tsitsiklis. The complexity of markov decision processes. Mathematics of operations research, 12(3):441 450, 1987.

[24] P. Poupart and N. Vlassis. Model-based bayesian reinforcement learning in partially observable domains. In Proc Int. Symp. on Artiﬁcial Intelligence and Mathematics,, pages 1 2, 2008.

[25] A. N. Rafferty, E. Brunskill, T. L. Grifﬁths, and P. Shafto. Faster teaching by pomdp planning. In International Conference on Artiﬁcial Intelligence in Education, pages 280 287. Springer, 2011.

[26] S. Ross, B. Chaib-draa, and J. Pineau. Bayes-adaptive pomdps. In Advances in neural information processing systems, pages 1225 1232, 2008.

[27] M. P. Schützenberger. On the deﬁnition of a family of automata. Information and control, 4(2-3):245 270, 1961.

[28] V. Sharan, S. M. Kakade, P. S. Liang, and G. Valiant. Learning overcomplete hmms. In Advances in Neural Information Processing Systems, pages 940 949, 2017.

[29] L. Song, B. Boots, S. Siddiqi, G. J. Gordon, and A. Smola. Hilbert space embeddings of hidden markov models. 2010.

[30] N. Vlassis, M. L. Littman, and D. Barber. On the computational complexity of stochastic controller optimization in pomdps. ACM Transactions on Computation Theory (TOCT), 4(4):1 8, 2012.