# reinforcement_learning_in_rewardmixing_mdps__ffa73d6a.pdf Reinforcement Learning in Reward-Mixing MDPs Jeongyeol Kwon The University of Texas at Austin kwonchungli@utexas.edu Yonathan Efroni Microsoft Research, NYC jonathan.efroni@gmail.com Constantine Caramanis The University of Texas at Austin constantine@utexas.edu Shie Mannor Technion, NVIDIA shie@ee.technion.ac.il, smannor@nvidia.com Learning a near optimal policy in a partially observable system remains an elusive challenge in contemporary reinforcement learning. In this work, we consider episodic reinforcement learning in a reward-mixing Markov decision process (MDP). There, a reward function is drawn from one of multiple possible reward models at the beginning of every episode, but the identity of the chosen reward model is not revealed to the agent. Hence, the latent state space, for which the dynamics are Markovian, is not given to the agent. We study the problem of learning a near optimal policy for two reward-mixing MDPs. Unlike existing approaches that rely on strong assumptions on the dynamics, we make no assumptions and study the problem in full generality. Indeed, with no further assumptions, even for two switching reward-models, the problem requires several new ideas beyond existing algorithmic and analysis techniques for efficient exploration. We provide the first polynomial-time algorithm that finds an ϵ-optimal policy after exploring O(poly(H, ϵ 1) S2A2) episodes, where H is time-horizon and S, A are the number of states and actions respectively. This is the first efficient algorithm that does not require any assumptions in partially observed environments where the observation space is smaller than the latent state space. 1 Introduction In reinforcement learning (RL), an agent solves a sequential decision-making problem in an unknown dynamic environment to maximize the long-term reward [46]. The agent interacts with the environment by receiving feedback on its actions in the form of a state-dependent reward and observation. One of the key challenges in RL is exploration: in the absence of auto-exploratory properties such as ergodicity of the system, the agent has to devise a clever scheme to collect data from under-explored parts of the system. In a long line of work, efficient exploration in RL has been extensively studied for fully observable environments, i.e., under the framework of Markov decision process (MDP) with a number of polynomial-time algorithms proposed [34, 27, 43, 40, 18]. For instance, in tabular MDPs, i.e., environments with a finite number of states and actions, we can achieve sample-optimality or minimax regret without any assumptions on system dynamics [3, 50]. In contrast to fully observable environments, very little is known about the exploration in partially observable MDPs (POMDPs). In general, RL in POMDPs may require an exponential number of samples (without simplifying structural assumptions) [35, 31]. Therefore, it is important to consider natural sub-classes of POMDPs which admit tractable solutions. Previous studies focus on a special class of POMDPs where observation spaces are large enough such that a single observation provides 35th Conference on Neural Information Processing Systems (Neur IPS 2021). sufficient information on the underlying state [22, 35, 4, 12, 16, 30]. However, there are many applications in robotics, topic modeling and beyond, where the observations space is smaller than the latent state space. This small observation space setting is critical, as most prior work does not apply. Our work seeks to provide one of the first results to tackle a class of problems in this space. In particular, we tackle a sub-class of POMDPs inspired by latent variable or mixture problems. Specifically, we consider reward-mixing MDPs (RM-MDPs). In RM-MDPs, the transition kernel and initial state distribution are defined as in standard MDPs, while the reward function is randomly chosen from one of M-reward models at the beginning of every episode (see the formal definition 1). RM-MDPs are important in their own right. Consider, for instance, a user-interaction model in a dynamical web system, where reward models vary across users with different characteristics [38, 23]. A similar setting has been considered in a number of recent papers [8, 6, 23, 45, 7], and most recently, and most directly related to our setting, [36]. Even for M = 2, the RM-MDP setting shares some of the key challenges of general POMDPs. This is because the best policy for the RM-MDP problem should account for not just the current state, but also an entire sequence of previous observations since some previous events might be correlated with a reward of the current action. And in the small observation space setting, prior work as in [22, 35, 4, 12, 16, 30] cannot be applied. The work in [36] develops an algorithmic framework for solving the RM-MDP problem in this small observation setting, but requires strong assumptions without which the algorithm falls apart. We discuss this in more detail below. Indeed, no work to date, has been able to address the RM-MDP problem, even for M = 2, without significant further assumptions. This is precisely the problem we tackle in this paper. Main Results and Contributions We focus on the problem of learning near-optimal policies in two reward-mixing MDPs, i.e., RM-MDPs with two reward models M = 2. To the best of our knowledge, no prior work has studied sample complexity of RL in RM-MDPs even for M = 2 without any assumptions on system dynamics. Specifically, we provide the first polynomial-time algorithm which learns an ϵ-optimal policy after exploring O(poly(H, ϵ 1) S2A2) episodes. We also show that Ω(S2A2/ϵ2) number of episodes is necessary, and thus our result is tight in S, A. This is the first efficient algorithm in a sub-class of partially observable domains with small observation spaces and without any assumptions on system dynamics. On the technical side, we must overcome several new challenges that are not present in fully observable settings. One key hurdle relates to identifiability. In standard MDPs, an average of single observations from a single state-action pair is sufficient to get unbiased estimators of transition and reward models for the state-action. However, in RM-MDPs, the same quantity would only give an averaged reward model, where the average is taken over the reward model distribution. However, an average reward model alone is not sufficient to obtain the optimal policy which depends on a sequence of previous rewards. Our technique appeals to the idea of uncertainty in higher-order moments, bringing inspiration from algorithms for learning a mixture of structured distributions [19, 9, 26, 15]. In such problems, the key for efficient learning is to leverage information from higher-order moments. Following this spirit, we consider estimating correlations of rewards at multiple different state-actions. A central challenge we must overcome is that in finite-horizon MDPs, it may be not possible to estimate all higher-order correlations with uniformly good accuracy. In fact, it may not be possible to estimate some correlations at all when some pairs of state-actions are not reachable in the same episode. Therefore, we cannot simply rely on existing techniques that require good estimations of all elements in higher-order moment matrices. The main technical challenge is, therefore, to show that a near-optimal policy can still be obtained from uncertain and partial correlations of state-actions. Our technical contributions are thus two-fold: (1) design of an efficient exploration algorithm to estimate each correlation up to some required accuracy, and (2) new technical tools to analyze errors from different levels of uncertainties in estimated higher-order correlations. Related Work Efficient exploration in fully observed environments has been extensively studied in the framework of MDPs. In tabular MDPs, a long line of work has studied efficient exploration algorithms and their sample complexity based on the principle of optimism [27, 43, 3, 18, 50, 47, 44] or posterior-sampling [1, 43, 40, 41]. Beyond tabular settings, recent work seeks for efficient exploration in more challenging settings such as with large or continuous state spaces [28, 32, 17]. All aforementioned work considers only fully observable environments where the complexity of exploration is significantly lower as there is no need to consider observation histories. Most previous study on exploration in POMDPs focuses on large observation spaces such that a set of single observations is statistically sufficient to infer internal states, i.e., the observation matrix has non-zero minimum singular value. Under such environments, previous work either considers a strong assumption on the uniform ergodicity of the latent state transitions [25, 5, 22, 4] or the uniqueness of underlying states for every observation [35, 12, 16]. The exception is a recent work by Jin et al. [30] which studied efficient exploration without any assumptions on latent system dynamics. One of the most closely related to problem to our work is a reinforcement learning problem with latent contexts (also referred as multitask RL), which is considered in [6, 23, 36]. Previous approaches relied on several strong assumptions such as very long time horizon [6, 23], revealed contexts in hindsight or ergodic properties of system dynamics [36]. Our work takes a first step without such assumptions by focusing on environments where only the reward model differs across different tasks. RM-MDP might be also viewed as a special case of adversarial MDPs (e.g., [49, 39, 29]) with an oblivious adversary who can only play within a finite set of reward models. However, our goal is to find an optimal policy in a broader class of history-dependent policies, whereas in adversarial MDP literature they compare only to the best Markovian policy in hindsight. RM-MDP can be also considered as an instance of latent variable models (e.g., [14, 20, 13, 25, 24]), where sample reward sequences are influenced by unobserved confounders. RM-MDP poses several new challenges as both challenges from reinforcement learning and statistical inference are interlaced by latent contexts. 2 Problem Setup We define the problem of episodic reinforcement learning problem with time-horizon H in rewardmixing Markov decision processes (RM-MDPs) defined as follows: Definition 1 (Reward-Mixing Markov Decision Process (RM-MDP)) Let M be a tuple (S, A, T, ν, {wm}M m=1, {Rm}M m=1) be a RM-MDP on a state space S and action space A where T : S A S [0, 1] is a common transition probability measures that maps a state-action pair and a next state to a probability, and ν is a common initial state distribution. Let S = |S| and A = |A|. w1, ..., w M are the mixing weights such that at the beginning of every episode, one reward model Rm is randomly chosen with probability wm. A reward model Rm : S A {0, 1} [0, 1] is a probability measure for rewards that maps a state-action pair and a binary reward to a probability for m [M]. In this work, we focus on a special case of RM-MDPs when M = 2, i.e., 2RM-MDPs. For the ease of presentation, we assume all random rewards take values from {0, 1}, i.e., rewards are binary random variables. In this work, we assume a uniform-prior over each latent context such that w1 = w and w2 = 1 w with w = 1/2. Detailed discussions on these simplifying assumptions can be found in Section 5. We consider a policy class Π which contains all history-dependent policies π : (S, A, {0, 1}) S A. The goal of the problem is to find a near optimal policy π Π that has near optimal value w.r.t. the optimal one, V M := maxπ Π PM m=1 wm Eπ m h PH t=1 rt i , where Eπ m[ ] is expectation taken over the mth MDP with a policy π. More formally, we wish our algorithm to return an (ϵ, η) provably-approximately-correct (PAC) optimal policy, which we also refer as a near optimal policy, defined as follows: Definition 2 ((ϵ, η)-PAC optimal policy) An algorithm is (ϵ, η)-PAC if it returns a policy ˆπ s.t. P(V M V ˆπ M ϵ) 1 η. 2.1 Notation We often denote a state-action pair (s, a) as one symbol x = (s, a) S A. For any sequence (y1, y2, ..., yt) of length t, we often simplify the notation as (y)1:t for any symbol y. We denote l1 sum of any function f over a random variable X conditioned on an event E as f(X|E) 1 = P X X |f(X|E)|, where X is a support of X. 1 {E} is an indicator function for any event E and Algorithm 1 Learning Two Reward-Mixture MDPs 1: Run pure-exploration (Algorithm 2) to estimate second-order correlations 2: Estimate 2RM-MDP parameters ˆ M from the collected data (Algorithm 3) 3: Return ˆπ, the (approximately) optimal policy of ˆ M P(E) is a probability of any event E measured without contexts. If we use P without any subscript, it is a probability of an event measured outside of the context, i.e., P( ) = PM m=1 wm Pm( ). We refer Pm the probability of any event measured in the mth context (or in mth MDP). If probability of an event depends on a policy π, we add superscript π to P. We denote V π M as an expected long-term reward for model M with policy π. We useˆ to denote empirical counterparts. For any set A and d N, A N d is a d-ary Cartesian product over A. We use a b to refer max(a, b) and a b to mean min(a, b) for a, b R. Specifically for M = 2: We use shorthand pm(x) := Rm(r = 1|x) for m = 1, 2. Let an expected averaged reward p+(x) := 1 2(p1(x) + p2(x)), differences in rewards p (x) := 1 2(p1(x) p2(x)) and (x) := |p (x)|. 3 Algorithms Before developing a learning algorithm for 2RM-MDP, let us provide intuition behind our algorithm. At a high-level, our approach lies on the following observation: The latent reward model of 2RM-MDP can be recovered from reward correlations and the averaged reward. Consider a simpler interaction model, in which we have a perfect and exact access to correlations of the reward function. That is, suppose we can query for any (xi, xj) (S A) N 2 the reward correlation function µ(xi, xj) := E [ri rj|xi, xj] = 1 2 (p1(xi)p1(xj) + p2(xi)p2(xj)). (1) Assume we can also query the exact expected average reward p+(x) for all x S A. Suppose we can construct a matrix B RSA SA indexed by state-actions x S A, such that at its (i, j) entry is given by: Bi,j = µ(xi, xj) p+(xi)p+(xj). Simple algebra shows that B = qq where q RSA is a vector indexed by x S A such that qi = p (xi). Hence, we can compute the top principal component of B, from which we can recover p (x) for all x S A. Through the access to p (x) (ignore sign ambiguity issue for now) and p+(x) we can recover the probabilities conditioned on the latent contexts via R1(r = 1|x) = p+(x) + p (x) and R2(r = 1|x) = p+(x) p (x). (2) Once we have the latent reward model we can use any approximate planning algorithm (e.g., pointbased value iteration) to find an ϵ-optimal policy. However, when such exact oracle is not supplied, and the interaction with the 2RM-MDP is based on trajectories and roll-in policies, getting a point-wise good estimate of B is not generally possible. For example, some state-action pairs cannot be visited in the same episode, in which case we cannot get any samples for the correlations of such pairs of state-actions: µ(xi, xj) is completely unknown. In such a case, recovery of the true model parameters is not possible in general even if we estimate all reachable pairs without errors. Hence the main challenge is as follows: can we estimate a model from a set of empirical second-order correlations such that the estimated model is close to the true model in terms of the optimal policy, even if it is not close in model parameters? For usual MDPs, states that cannot be reached pose no problems: we do not need to learn transition or reward probabilities of these states, since no optimal policy will use that state. Indeed, with this observation at hand, reward free exploration techniques [31, 33] can be utilized to learn a good model w.r.t. all possible reward functions (given a fixed initial distribution). Our main conceptual contribution is to show how these techniques can be leveraged for the RM-MDP problem. Overview of our approach. Our approach develops the following ideas. 1. Section 3.1: We define what we call a Second Order MDP, an augmented MDP whose states are pairs of states of the original MDP. Borrowing technology from single MDPs, we show that reward free exploration on the augmented MDP can be used to approximate second order moments of the original MDP, along with confidence intervals, obtained by how often a particular pair of states can be reached. 2. Section 3.2: We show how to find the parameters of a model whose second order moments are in the confidence set obtained from the augmented MDP. To do this, we show that we can use linear programming to select an element of the uncertainty set that corresponds to valid second order moments for a model. 3. Section 3.3: The main contribution of the analysis, is then to show that for any model whose true second order statistics lie in this set obtained, an approximate planning oracle is guaranteed to return an ϵ-optimal solution to the true model. Theorem 3.2 shows that sample complexity depends quadratically in (S A). This is well-expected, given that our algorithm must compute statistics on the augmented graph. We next give a matching lower bound: Theorem 3.3 shows that this quadratic dependence is unavoidable. 3.1 Pure Exploration of Second-Order Correlations Our first goal is to collect samples for pairs of state-actions through exploration. This will ultimately allow us to estimate the correlations of the reward function, as well as the average reward per-state p+(x). The backbone of pure-exploration is adaptive reward-free exploration schemes studied in standard MDP settings [33]. We first consider a surrogate model that helps to formulate data collection process for correlations of pairs. Specifically, we first define the augmented MDP: Definition 3 (Augmented Second-Order MDPs) An augmented second-order MDP f M is defined on a state-space e S and action-space e A where e S = n (i, v, s)| i {1, 2, 3}, v : (v1, v2) ((S A) {null}) N 2, s S o , e A = {(a, z)| a A, z {0, 1}}. Then in f M, an augmented state (it, vt, st) evolves under an action (at, zt) as follows: i1 = 1, v1 = (null, null), s1 ν( ), st+1 T( |st, at), vj t+1 = vj t j [2]/{it}, it+1 = it if zt = 0 or it = 3 it + 1 else , vit t+1 = vit t if zt = 0 or it = 3 (st, at) else . (3) The second-order augmented MDP consists of an extended state space where the first two coordinates i, v store previously selected state-actions followed by the coordinate s for the current state. The action space is a product of original actions a and choice actions z, where z = 1 means that we select the current state-action to be explored as a part of a pair in the episode. Specifically, when z = 1, by design we increase the count i. If i reaches 3, it means we collected a sample of a pair (v1, v2) in the episode. Initially the true model parameters are unknown and thus we are not aware of how to collect samples for any pair of state-actions or how much samples we need for each pair. In order to resolve this, we use ideas from reward free exploration for MDPs: we consider the upper confidence error bound function e Q that follows from the Bellman-equation for f M with (pure) exploration bonus: br(i, v, z) = 1 {i = 2 z = 1} 1 r ι2 n(v ) , b T (s, a) = 1 r ιT n(s, a) e Qt((i, v, s), (a, z)) = 1 br(i, v, z) + Es ˆT ( |s,a) h e Vt+1(i , v , s ) i + b T (s, a) , (4) where 1 {i = 2 z = 1} is an indicator of whether to collect samples for correlations between stateactions in v . Here, i and v are first and second coordinates of the next state following the transition rule (3), ι2 = O(log(K/η)), ιT = O(S log(K/η)) are properly set confidence interval parameters, Algorithm 2 Pure Exploration of Second-Order Correlations 1: Initialize e Q( )( ) = e V0 = 1, n(v) = 0 for all v (S A) N 2. 2: while e V0 > ϵpe do 3: Get an initial state s1 for the kth episode. Let v1 = (null, null), i = 1, rc = 1 4: for t = 1, 2, ..., H do 5: Pick (at, zt) = arg max(a,z) e A e Qt((it, vt, st), (a, z)). 6: Play action at, observe next state st+1 and reward rt. 7: if it 2 and z = 1 then 8: rc rc rt 9: end if 10: Update (it+1, vt+1, st+1) according to the choice of at and zt following the rule in (3) 11: end for 12: if i H+1 = 3 then 13: n(v H+1) n(v H+1) + 1, ˆµ(v H+1) (1 1/n(v H+1)) ˆµ(v H+1) + rc 14: end if 15: Update ˆν, ˆT, ˆp+ from the trajectory (s, a, r)1:H, and then update e Q and e V using (4), (5) 16: end while K is the total number of episodes to be explored and e QH+1( ) = 0. At every episode, we choose a greedy action with respect to e Q and collect sufficient data for pairs of state-actions by pure exploration on the augmented MDP. We repeat the pure-exploration process until e V0 becomes less than the threshold ϵpe where e Vt(i, v, s) = max (a ,z ) e A e Qt((i, v, s), (a , z )), e V0 = p s ˆν(s) e V1(1, v1, s), (5) and ιν = O(S log(K/η)) is a confidence interval parameter for initial states. The pure-exploration procedure is summarized in Algorithm 2. The main purpose of Algorithm 2 is to balance the amount of samples for correlations. 3.2 Recovery of Empirical Model with Uncertainty Once we gather enough samples to estimate correlations of pairs of state-actions we are assured that V0 ϵpe. Given this, we describe an efficient algorithm to recover a good estimate of the true 2RM-MDP parameters. We formulate the model recovery problem as a linear program (LP). Recall that for any xi, xj (S A) N 2, we can compute the multiplication of differences at two positions xi, xj: u(xi, xj) := p (xi)p (xj) = µ(xi, xj) p+(xi)p+(xj). (6) Ideally with exact oracles for getting values of u(xi, xj), we can recover p (x) through LP: let l(x) := log |p (x)| for all x S A. Then we first find a solution for the following LP: l(xi) + l(xj) = log |µ(xi, xj) p+(xi)p+(xj)|, xi, xj S A. After solving for l(x), we can find consistent assignments of signs sign(x) such that p (x) = sign(x) exp(l(x)), which can be formulated as 2-Satisfiability problem [2]. However after the pure-exploration phase, we only have estimates of correlations ˆµ(xi, xj) and confidence intervals b(xi, xj) := p ι2/n(xi, xj)+ϵ2 0 where ϵ0 = ϵ/H2 is a small tolerance parameter (as well as estimates of the expected average reward ˆp+(x) and its confidence interval b(x) := p ι2/n(x)). Note that n(xi, xj) is the number of counts that a pair (xi, xj) is counted during pureexploration phase. Let ˆu(xi, xj) := ˆp (xi)ˆp (xj) be an empirical estimate of u(xi, xj). We want to recover ˆp (x) = sign(x) exp(ˆl(x)) such that the following is satisfied: |(ˆµ(xi, xj) + ˆp+(xi)ˆp+(xj)) ˆu(xi, xj)| b(xi, xj), xi, xj S A. (7) We present in Appendix B.1 the LP formulation to recover ˆl(x) and sign(x). The outline of the procedure is stated in Algorithm 3 and more details are presented in Algorithm 4. In order for Algorithm 3 to succeed, we need the following lemma on the existence of feasible solutions in the LP with high probability: Algorithm 3 Reward Model Recovery from Second-Order Correlations 1: Solve an LP for ˆl(x) and find sign(x) for all x S A that satisfies (7). 2: Clip ˆl(x) within ( , ˆu(x)] where ˆu(x) = log (min(ˆp+(x), 1 ˆp+(x)) for all x S A. 3: Set ˆp (x) = sign(x) exp(ˆl(x)) for all x S A. 4: Let ˆp1(x) = ˆp+(x) + ˆp (x), ˆp2(x) = ˆp+(x) ˆp (x) for all x S A. 5: Return ˆ M = (S, A, ˆT, ˆν, { ˆRm}2 m=1), an empirical 2RM-MDP model. Lemma 3.1 Suppose that we set the confidence parameters as ι1 = ι2 = O(log(SA/η)) and ιν = ιT = O(S log(SA/η)). Then with probability at least 1 η, Algorithm 3 returns a model ˆ M that satisfies constraints (7) for all xi, xj S A. Given Lemma 3.1, Algorithm 3 is able to find at least one feasible solution, i.e., model ˆ M, in polynomial-time. Via such a solution, we obtain ˆp RSA and can recover an estimate of the latent model p1(s, a) = ˆR1(r = 1|s, a) and p2(s, a) = ˆR2(r = 1|s, a) for all state-action pairs by a plug-in estimate (2). We refer to such a model as the empirical model. The proof of Lemma 3.1 is given in Appendix B.2. 3.3 Main Theoretical Results Upper Bound We first prove the correctness of our main algorithm (Algorithm 1) and compute an upper bound on its sample complexity. Our upper bound has a quadratic dependence on S A. This is well-expected because we must estimate quantities related to the augmented second order MDP. Our lower bound shows that this dependence is in fact tight. Additionally, our bounds exhibit an interesting dependence on the error threshold, ϵ, that in fact depends on the minimum separation in distinguishable state-actions: δ = 1 min x S A: (x)>0 (x). (8) where (x) := |p (x)|. This is reminiscent of similar results for mixture models (e.g., [10, 37]). We state the following theorem on the sample-complexity of 2RM-MDPs. Theorem 3.2 There exists a universal constant C > 0 such that if we run Algorithm 1 with ϵpe = ϵ H3 δ ϵ H2 , then Algorithm 2 terminates after at most K episodes where, K = C H6 S2A ϵ2 (H + A) ϵ H2 δ 2 log(HSA/ϵη) log2(H/ϵ), with probability at least 1 η. Furthermore, under the same high probabilistic event, Algorithm 1 returns an ϵ-optimal policy. Theorem 3.2 shows that we can find a ϵ-optimal policy for any 2RM-MDP instance with at most O poly(H, ϵ 1) S2A2 episodes of exploration. We leave it as future work to investigate whether dependency on ϵ can be uniformly O(1/ϵ2) regardless of δ, as well as whether the dependency on H can be tightened. Lower Bound In Theorem 3.2, the sample complexity guaranteed by Algorithm 1 is Ω(S2A2). We show that the dependency on S2A2 is also necessary for two reward-mixing MDPs: Theorem 3.3 There exists a class of 2RM-MDPs such that in order to obtain ϵ-optimal policy, we need at least Ω(S2A2/ϵ2) episodes. The proof of the lower bound can be found in Appendix C. Computational Complexity Main bottlenecks of Algorithm 1 are in two parts: (1) solving the LP in Algorithm 3 and (2) computing the optimal policy for ˆ M. For (1), with SA variables and O(S2A2) constraints, solving the LP in Algorithm 3 takes poly(S, A) times (see [11] and the references therein for some known computational complexity results for solving LPs). For (2), point-based value-iteration algorithm [42] runs in O(ϵ 2H5SA) time to obtain ϵ-optimal policy for any specified 2RM-MDP model. Therefore, the overall running time of Algorithm 1 is polynomial in all parameters. 4 Analysis Overview For the ease of presentation, we assume that the true transition and initial state probabilities are known in advance. In Appendix B, we complete the proof of Theorem 1 without assuming known transition and initial probabilities but instead using ˆν and ˆT estimated during the pure-exploration phase. We first note that given any policy π Π, difference in expected rewards can be bounded by l1statistical distance in trajectory distributions. More specifically, consider the set of all possible trajectories T = (S A {0, 1}) N H, that is, any state-action-reward sequence of length H. Then, |V π M V π ˆ M| H (Pπ ˆPπ)((x, r)1:H) 1 = X τ T |Pπ(τ) ˆPπ(τ)|, where ˆPπ( ) is the probability of an event measured in the empirical model ˆ M with policy π. Therefore, for any policy π Π, our goal is to show that P τ T |Pπ(τ) ˆPπ(τ)| O(ϵ/H), i.e., the true and empirical models are close in l1-statistical distance in trajectory distributions. The main challenge in the analysis is to bound the l1 distance when estimated first and second-order moments are within the range of confidence intervals. We now show that when all pairs of state-actions in τ were visited sufficient amount of times then we can bound the l1 difference. Let us first divide a set of trajectories into several groups depending on the number of times that every pair in each trajectory has been explored during the pure-exploration phase. We first define the following sets: Xl = {(xi, xj) (S A) N 2 | n(xi, xj) nl}, El = {x1:H T | t1, t2 [H], t1 = t2 s.t. (xt1, xt2) Xl}, (9) where nl = C ι2δ 2 l ϵ 2 l for some absolute constant C > 0, ϵ0 = ϵ/H2 and ϵl = 2ϵl 1 for l 1. δl = max(δ, ϵl) is a threshold for distinguishable state-actions x such that (x) δl. In a nut-shell, El is a set of sequences of state-actions in which every pair of state-actions has been explored at least nl times. Then we split a set of trajectories into disjoint sets E 0 = E0 and E l = Ec l 1 El and for l [L + 1], i.e., a set of trajectories with all pairs visited more than nl and at least one pair visited less than nl 1, where we let L = O(log(H/ϵ)) be the largest integer such that n L C ι2. Let n L+1 = 0 and ϵL+1 = 1. Recall that our goal is to control the l1-statistical distance between distributions of trajectories for any history-dependent policies that resulted from true and empirical models. With above machinery, we can, instead, bound the l1 statistical distance between all trajectories as the following: (Pπ ˆPπ)(τ) 1 = τ:x1:H E l |Pπ(τ) ˆPπ(τ)| l=0 sup π Π Pπ(x1:H E l) O(Hϵl), (10) where Pπ(x1:H E l), is the probability that the random trajectory τ observed when following a policy π is contained within the set E l. The following lemma is critical in bounding the above inner sum for each fixed l: Lemma 4.1 (Eventwise Total Variance Discrepancy) For any l {0, 1, ..., L+1} and any historydependent policy π Π, we have: X τ:x1:H E l |Pπ(τ) ˆPπ(τ)| sup π Π Pπ(x1:H E l) O(Hϵl). (11) The second key connection is the relation between supπ Π Pπ(E l) and the data collected in pureexploration phase. The following lemma connects them: Lemma 4.2 (Event Probability Bound by Pure Exploration) With probability at least 1 η Algorithm 2 terminates after at most K episodes where K C S2A(H + A)ϵ 2 pe log(HSA/(ϵpeη)), (12) with some absolute constant C > 0. Furthermore, for every l [L + 1] we have sup π Π Pπ(x1:H E l) O Hϵpeδ 1 l ϵ 1 l . (13) Given the above two lemmas, we set ϵpe = ϵδ0/H3L. We then apply Lemma 4.2 and (10) to bound a difference in expected value of true and empirical models: |V π M V π ˆ M| H (Pπ ˆPπ)(τ) 1 O H3ϵpeδ 1 0 O(ϵ). Finally, plugging ϵpe = ϵδ0/H3L into the required number of episodes (12) gives a sample-complexity bound presented in Theorem 3.2. Full proof of Lemma 4.2 is given in Appendix B.3. 4.1 Key Ideas for Lemma 4.1 The core of analysis by which we establish Lemma 4.1 goes as follows. We split the state-action pairs into two sets distinguishable and indistinguishable state-action pairs. Fix an l {0, 1, ...L + 1} and consider then set E l. We call a state-action pair x S A δ-distinguishable if (x) is larger then some δ > 0. Then, by the definition of (x), it implies that p1(x) and p2(x) are sufficiently different. For a trajectory τ E l, we say that a state-action x within τ is δl-distinguishable if (x) δl. States that are distinguishable have the following property. If pairwise correlations between three distinguishable state-actions are well estimated, then we can recover individual reward probability for each of the three state-actions as in the following lemma: Lemma 4.3 Suppose that (x1, x2), (x2, x3), (x3, x1) Xl and (xi) δl for all i {1, 2, 3}. Let i = arg maxi [3] (xi), and i 2 = arg maxi [3],i =i (xi). Then we have i = i , | ˆ (xi) (xi)| ϵl/2, and ˆ (xi ) (xi ) i If all reward parameters in a trajectory are relatively well-estimated, then we can bound the error in the estimated and true probability of the trajectory. Thus, if a trajectory τ E l contains more than two distinguishable state-actions, we can rely on the closeness in reward parameters to bound |Pπ(τ) ˆPπ(τ)|. On the other hand, if there are no such three distinguishable state-actions in τ, then we cannot guarantee the correctness of reward parameters in τ. In such case, we take a different path to show the closeness in trajectory distributions. Full proof is given in Appendix A. 5 Discussions and Future Work We developed the first efficient algorithm for RL in two reward-mixing MDPs without any assumptions on system dynamics. We conclude this work with discussions on our assumptions and several future directions. Non-uniform/Unknown Mixing Weights. When the prior over contexts is not uniform, i.e., w = 1/2, parameter recovery procedure (Algorithm 3) may be more complicated due to the sign-ambiguity issue in p (x). When w = 1/2, the estimated model is identical whether we use ˆp (x) p (x) or ˆp (x) p (x). However, the model is no longer identical when we have correct signs of p (x) or not if w = 1/2. In Appendix E, we discuss the extension to non-uniform mixing weights in a great detail. Once we can solve with any known w [0, 1], then one possible solution to handle unknown mixing weight is to sweep all possible mixing weight for w from 0 to 1 with discretization level O(ϵ/H), recover an empirical model obtained by solving an LP assuming known w for all discretized weights and pick the best policy by testing each policy on the environment. Structured Reward Distributions. Our approach can be extended to structured reward distributions, e.g., distributions over finite supports. For instance, if a reward can take l different values in {q1, q2, ..., ql}, we redefine µ(x1, x2) := 1 2 P2 m=1 pm(x1)pm(x2) as a l l moment-matrix with setting pm(x) = [Rm(r = q1|x), Rm(r = q2|x), ..., Rm(r = ql|x)] . Then, we can extend an LP in Algorithm 3 to find ˆpm(x) to match µ(x1, x2) within confidence intervals. This approach would also pay extra poly(l)-factors both in sample and computational complexity. Then, once we can verify that the recovered difference in reward models is close, i.e., ˆp (x) p (x) 1 O(ϵl), then the remaining proof for Theorem 3.2 would follow similar to the Bernoulli reward case. We leave a more thorough investigation in this direction as future work. Extensions to More Than Two Reward Models M > 2. As mentioned earlier, RM-MDP with M reward models is an MDP analogy to learning a mixture of M product distributions over reward distributions. If every higher-order correlation of interest between multiple state-actions can be estimated, we may apply existing methods for learning a mixture of product distributions over a binary field [20, 19, 26]. For instance, Jain and Oh [26] proposed a spectral method which requires all off-diagonal elements of up to third order moments to learn parameters of M-mixtures of n-product distributions. To apply their method, we may estimate correlations of all three different state-actions as similarly in Algorithm 2, then reward model recovery of RM-MDP can be converted to learning a mixture of n = SA-product reward distributions. However, as we have already seen in this paper, the real challenge is in recovering the model from correlations of unevenly reachable sets of state-actions, and then analyzing errors with different levels of uncertainty in correlations. We believe this is a very interesting direction for future research. In addition to the above mentioned issues, we believe there are a number of possible future directions. While we specifically focused on RM-MDP problems, the more general problem of Latent MDP [36] has not been resolved even for M = 2 without any assumptions. We believe that it would be an interesting future direction to develop efficient algorithms for Latent MDPs with small M. We can also consider more technical questions in RM-MDPs, e.g., what would be the optimal sample-complexity dependency on H, ϵ, and whether we can develop a fully online algorithm to minimize regret. We leave them as future work. Acknowledgement The research was funded by NSF grant 2019844, and by the Army Research Office and was accomplished under Cooperative Agreement Number W911NF-19-2-0333. The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the Army Research Office or the U.S. Government. The U.S. Government is authorized to reproduce and distribute reprints for Government purposes notwithstanding any copyright notation herein. [1] J. Asmuth, L. Li, M. L. Littman, A. Nouri, and D. Wingate. A bayesian sampling approach to exploration in reinforcement learning. In Proceedings of the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence, pages 19 26, 2009. [2] B. Aspvall, M. F. Plass, and R. E. Tarjan. A linear-time algorithm for testing the truth of certain quantified boolean formulas. Information Processing Letters, 8(3):121 123, 1979. [3] M. G. Azar, I. Osband, and R. Munos. Minimax regret bounds for reinforcement learning. In International Conference on Machine Learning, pages 263 272. PMLR, 2017. [4] K. Azizzadenesheli, A. Lazaric, and A. Anandkumar. Reinforcement learning of POMDPs using spectral methods. In Conference on Learning Theory, pages 193 256, 2016. [5] B. Boots, S. M. Siddiqi, and G. J. Gordon. Closing the learning-planning loop with predictive state representations. The International Journal of Robotics Research, 30(7):954 966, 2011. [6] E. Brunskill and L. Li. Sample complexity of multi-task reinforcement learning. In Uncertainty in Artificial Intelligence, page 122. Citeseer, 2013. [7] P. Buchholz and D. Scheftelowitsch. Computation of weighted sums of rewards for concurrent MDPs. Mathematical Methods of Operations Research, 89(1):1 42, 2019. [8] I. Chadès, J. Carwardine, T. Martin, S. Nicol, R. Sabbadin, and O. Buffet. Momdps: a solution for modelling adaptive management problems. In Twenty-Sixth AAAI Conference on Artificial Intelligence (AAAI-12), 2012. [9] S.-O. Chan, I. Diakonikolas, X. Sun, and R. A. Servedio. Learning mixtures of structured distributions over discrete domains. In Proceedings of the twenty-fourth annual ACM-SIAM symposium on Discrete algorithms, pages 1380 1394. SIAM, 2013. [10] Y. Chen, X. Yi, and C. Caramanis. Convex and nonconvex formulations for mixed regression with two components: Minimax optimal rates. IEEE Transactions on Information Theory, 64(3):1738 1766, 2017. [11] M. B. Cohen, Y. T. Lee, and Z. Song. Solving linear programs in the current matrix multiplication time. Journal of the ACM (JACM), 68(1):1 39, 2021. [12] C. Dann, N. Jiang, A. Krishnamurthy, A. Agarwal, J. Langford, and R. E. Schapire. On oracleefficient PAC RL with rich observations. In Advances in neural information processing systems, pages 1422 1432, 2018. [13] S. Dasgupta. Learning mixtures of gaussians. In 40th Annual Symposium on Foundations of Computer Science (Cat. No. 99CB37039), pages 634 644. IEEE, 1999. [14] R. D. De Veaux. Mixtures of linear regressions. Computational Statistics & Data Analysis, 8(3):227 245, 1989. [15] I. Diakonikolas, G. Kamath, D. Kane, J. Li, A. Moitra, and A. Stewart. Robust estimators in high-dimensions without the computational intractability. SIAM Journal on Computing, 48(2):742 864, 2019. [16] S. Du, A. Krishnamurthy, N. Jiang, A. Agarwal, M. Dudik, and J. Langford. Provably efficient rl with rich observations via latent state decoding. In International Conference on Machine Learning, pages 1665 1674, 2019. [17] S. S. Du, S. M. Kakade, R. Wang, and L. F. Yang. Is a good representation sufficient for sample efficient reinforcement learning? In International Conference on Learning Representations, 2019. [18] Y. Efroni, N. Merlis, M. Ghavamzadeh, and S. Mannor. Tight regret bounds for model-based reinforcement learning with greedy policies. In Advances in Neural Information Processing Systems, pages 12224 12234, 2019. [19] J. Feldman, R. O Donnell, and R. A. Servedio. Learning mixtures of product distributions over discrete domains. SIAM Journal on Computing, 37(5):1536 1564, 2008. [20] Y. Freund and Y. Mansour. Estimating a mixture of two product distributions. In Proceedings of the twelfth annual conference on Computational learning theory, pages 53 62, 1999. [21] A. Garivier, P. Ménard, and G. Stoltz. Explore first, exploit next: The true shape of regret in bandit problems. Mathematics of Operations Research, 44(2):377 399, 2019. [22] Z. D. Guo, S. Doroudi, and E. Brunskill. A PAC RL algorithm for episodic POMDPs. In Artificial Intelligence and Statistics, pages 510 518, 2016. [23] A. Hallak, D. Di Castro, and S. Mannor. Contextual markov decision processes. ar Xiv preprint ar Xiv:1502.02259, 2015. [24] J. Hoffmann, S. Basu, S. Goel, and C. Caramanis. Learning mixtures of graphs from epidemic cascades. In International Conference on Machine Learning, pages 4342 4352. PMLR, 2020. [25] D. Hsu, S. M. Kakade, and T. Zhang. A spectral algorithm for learning hidden markov models. Journal of Computer and System Sciences, 78(5):1460 1480, 2012. [26] P. Jain and S. Oh. Learning mixtures of discrete product distributions using spectral decompositions. In Conference on Learning Theory, pages 824 856. PMLR, 2014. [27] T. Jaksch, R. Ortner, and P. Auer. Near-optimal regret bounds for reinforcement learning. Journal of Machine Learning Research, 11:1563 1600, 2010. [28] N. Jiang, A. Krishnamurthy, A. Agarwal, J. Langford, and R. E. Schapire. Contextual decision processes with low bellman rank are pac-learnable. In International Conference on Machine Learning, pages 1704 1713. PMLR, 2017. [29] C. Jin, T. Jin, H. Luo, S. Sra, and T. Yu. Learning adversarial mdps with bandit feedback and unknown transition. ar Xiv preprint ar Xiv:1912.01192, 2019. [30] C. Jin, S. M. Kakade, A. Krishnamurthy, and Q. Liu. Sample-efficient reinforcement learning of undercomplete POMDPs. ar Xiv preprint ar Xiv:2006.12484, 2020. [31] C. Jin, A. Krishnamurthy, M. Simchowitz, and T. Yu. Reward-free exploration for reinforcement learning. ar Xiv preprint ar Xiv:2002.02794, 2020. [32] C. Jin, Z. Yang, Z. Wang, and M. I. Jordan. Provably efficient reinforcement learning with linear function approximation. In Conference on Learning Theory, pages 2137 2143. PMLR, 2020. [33] E. Kaufmann, P. Ménard, O. D. Domingues, A. Jonsson, E. Leurent, and M. Valko. Adaptive reward-free exploration. In Algorithmic Learning Theory, pages 865 891. PMLR, 2021. [34] M. Kearns and S. Singh. Near-optimal reinforcement learning in polynomial time. Machine learning, 49(2):209 232, 2002. [35] A. Krishnamurthy, A. Agarwal, and J. Langford. PAC reinforcement learning with rich observations. In Advances in Neural Information Processing Systems, pages 1840 1848, 2016. [36] J. Kwon, Y. Efroni, C. Caramanis, and S. Mannor. RL for latent MDPs: Regret guarantees and a lower bound. ar Xiv preprint ar Xiv:2102.04939, 2021. [37] J. Kwon, N. Ho, and C. Caramanis. On the minimax optimality of the EM algorithm for learning two-component mixed linear regression. ar Xiv preprint ar Xiv:2006.02601, 2020. [38] O.-A. Maillard and S. Mannor. Latent bandits. In International Conference on Machine Learning, pages 136 144, 2014. [39] G. Neu, A. Gyorgy, and C. Szepesvári. The adversarial stochastic shortest path problem with unknown transition probabilities. In Artificial Intelligence and Statistics, pages 805 813. PMLR, 2012. [40] I. Osband and B. Van Roy. Model-based reinforcement learning and the eluder dimension. In Advances in Neural Information Processing Systems, pages 1466 1474, 2014. [41] I. Osband and B. Van Roy. Why is posterior sampling better than optimism for reinforcement learning? In International Conference on Machine Learning, pages 2701 2710. PMLR, 2017. [42] J. Pineau, G. Gordon, and S. Thrun. Anytime point-based approximations for large POMDPs. Journal of Artificial Intelligence Research, 27:335 380, 2006. [43] D. Russo and B. Van Roy. Eluder dimension and the sample complexity of optimistic exploration. Advances in Neural Information Processing Systems, 2013. [44] M. Simchowitz and K. G. Jamieson. Non-asymptotic gap-dependent regret bounds for tabular mdps. In Advances in Neural Information Processing Systems, pages 1153 1162, 2019. [45] L. N. Steimle, D. L. Kaufman, and B. T. Denton. Multi-model markov decision processes. Optimization Online URL http://www. optimization-online. org/DB_FILE/2018/01/6434. pdf, 2018. [46] R. S. Sutton and A. G. Barto. Reinforcement learning: An introduction. MIT press, 2018. [47] A. Tossou, D. Basu, and C. Dimitrakakis. Near-optimal optimistic reinforcement learning using empirical bernstein inequalities. ar Xiv preprint ar Xiv:1905.12425, 2019. [48] M. J. Wainwright. High-dimensional statistics: A non-asymptotic viewpoint, volume 48. Cambridge University Press, 2019. [49] J. Y. Yu, S. Mannor, and N. Shimkin. Markov decision processes with arbitrary reward processes. Mathematics of Operations Research, 34(3):737 757, 2009. [50] A. Zanette and E. Brunskill. Tighter problem-dependent regret bounds in reinforcement learning without domain knowledge using value function bounds. In International Conference on Machine Learning, pages 7304 7312. PMLR, 2019.