# tractable_optimality_in_episodic_latent_mabs__28ba0abf.pdf

Tractable Optimality in Episodic Latent MABs

Jeongyeol Kwon University of Wisconsin-Madison jeongyeol.kwon@wisc.edu

Yonathan Efroni Meta, New York jonathan.efroni@gmail.com

Constantine Caramanis The University of Texas at Austin constantine@utexas.edu

Shie Mannor Technion, NVIDIA shie@ee.technion.ac.il, smannor@nvidia.com

We consider a multi-armed bandit problem with A actions and M latent contexts, where an agent interacts with the environment for an episode of H time steps. Depending on the length of the episode, the learner may not be able to estimate accurately the latent context. The resulting partial observation of the environment makes the learning task significantly more challenging. Without any additional structural assumptions, existing techniques to tackle partially observed settings imply the decision maker can learn a near-optimal policy with O(A)H episodes, but do not promise more. In this work, we show that learning with polynomial samples in A is possible. We achieve this by using techniques from experiment design. Then, through a method-of-moments approach, we design a procedure that provably learns a near-optimal policy with O(poly(A) + poly(M, H)min(M,H)) interactions. In practice, we show that we can formulate the moment-matching via maximum likelihood estimation. In our experiments, this significantly outperforms the worst-case guarantees, as well as existing practical methods.

1 Introduction

In Multi-Armed Bandits (MABs), an agent learns to act optimally by interacting with an unknown environment. In many applications, interaction sessions are short, relative to the complexity of the overall task. As a motivating example, we consider a large content website that interacts with users arriving to the site, by making sequential recommendations. In each such episode, the system has only a few chances to recommend items before the user leaves the website an interaction time typically much smaller than the number of actions or the types of users. In such settings, agents often have access to short horizon episodes, and have to learn how to process observations from different episodes to learn the best adaptation strategy.

Motivated by such examples, we consider the problem of learning a near-optimal policy in Latent Multi-Armed Bandits (LMABs). In each episode with time-horizon H, the agent interacts with one of M possible MAB environments (e.g., type of user) randomly chosen by nature. Without knowing the identity of the environment (we call this the latent context), an agent aims to maximize the expected cumulative reward per episode (see Definition 2.1 for a formal description). The LMAB framework is different from the setting considered in [31] where multiple episodes proceed in parallel without limiting the horizon of an episode H. For long horizons H, we show that it is possible to find a near-optimal policy and to determine near-optimal actions for each episode, as if we knew the hidden context. If H A where A is the total number of actions, however, this is no longer possible. Instead, we aim to learn the best history-dependent policy for H time steps.

36th Conference on Neural Information Processing Systems (Neur IPS 2022).

1.1 Our Results and Contributions

In this work, we study the problem of learning a near optimal policy of an LMABs for both long and short horizon H. In the long horizon case, we show the problem s latent parameters are learnable. However, for short horizon, the latent parameters may not be identifiable and a different perspective is required.

A naive approach to learn a near optimal policy of an LMAB starts by casting the problem as a Markov Decision Process (MDP). Assume that the reward values are discrete and defined over a finite alphabet of cardinality Z. By defining the state space as the set of all sequences of history observations an LMAB can be formulated as an MDP with O((AZ)H) states. Appealing to standard reinforcement learning algorithms (e.g., [18, 6]) we can learn an ϵ-optimal policy with O(AZ)H/ϵ2 samples.

A natural way to improve upon the naive approach is through using the unique structure of the LMAB setting. In this work we focus on the following question: can we learn a near-optimal policy with fewer than O(AZ)H/ϵ2 samples as the naive approach?

Our work answers the above question affirmatively. Specifically, our main contributions are as follows. We show that the dependence of our algorithm is poly(A)+poly(H, M)min(H,M), and thus tractable when either H or M is small, even for very large A. That is, we are particularly focused on the setting with a few contexts or relatively short episodes (in comparison to A), i.e., M = O(1) or H = O(1), where a natural objective is to learn a near-optimal history-dependent policy for H time steps (see also Figure 3 in Appendix B.1).

1.2 Related Work

Due to space constraints, we discuss only the most closely related work here, and defer a lengthier discussion on related work to Appendix B.1.

Learning priors in multi-armed bandit problems Several recent works have considered the Bayesian learning framework with short time-horizon [29, 35, 21]. The focus in this line of work is on the design of algorithms that learn the prior, while acting with a fixed Bayesian algorithm (e.g., Thompson sampling). While Bayesian learning with short time-horizon may be viewed as a special case of LMABs, the baseline policy we compare ourselves to is the optimal H-step policy, which is a harder baseline than considering a fixed Bayesian algorithm.

Latent MDPs Some prior work considers the framework of Latent MDPs (LMDPs), which is the MDP generalization of LMAB [16, 37, 24, 23]. In particular, [24] has shown the information-theoretic limit of learning in LMDPs with no assumptions on MDPs, i.e., an exponential number of sample episodes Ω(AM) is necessary to learn a near-optimal policy in LMDPs. In contrast, we show that in LMABs, the required number of episodes can be polynomial in A. This does not contradict the result in [24], since their lower bound construction comes from the challenge in state-exploration with latent contexts. In contrast, there is no state-exploration issue for bandits, which enables the polynomial sample complexity in A. Furthermore, our upper bound does not require any assumptions or additional information such as good initialization or separations as in [24]. To the best of our knowledge, no existing results are known for learning a near optimal policy of LMAB instances for M 3 without further assumptions.

Learning in POMDPs with full-rank observations One popular learning approach in partially observable systems is the tensor-decomposition method, which extracts the realization of model parameters from third-order tensors [1, 7, 15]. However, the recovery of model parameters require specific geometric assumptions on the full-rankness of a certain test-observation matrix. Furthermore, most prior work requires a uniform reachability assumption, i.e., all latent spaces should be reached with non-negligible probabilities by any exploration policy for the parameter recovery. Recent results in [19, 30] have shown that the uniform reachability assumption can be dropped with the optimism principle. However, they still require the full-rankness of a test-observation matrix to keep the volume of a confidence set explicitly bounded. Since LMAB instances do not necessarily satisfy the full-rankness assumption, their results do not imply an upper bound for learning LMABs.

2 Preliminaries

We define the problem of episodic latent multi-armed bandits with time-horizon H 2 as follows:

Definition 2.1 (Latent Multi-Armed Bandit (LMAB)) Let LMAB be a tuple B = (A, {wm}M m=1, {µm}M m=1), where A is a set of actions, {wm}M m=1 are the mixing weights such that a latent context m is randomly chosen with probability wm, and µm is the model parameter that describes a reward distribution, i.e., Pµm(r | a) := P(r | m, a), according to an action a A conditioning on a latent context m (each µm parameterizes the probability model, and is not necessarily a mean reward vector).

We do not assume a priori knowledge of mixing weights. The bulk of this paper considers discrete reward realizations, when the support of the reward distribution is finite and bounded. In Appendix D, we also show that our results can be adapted to Gaussian reward distributions.

Assumption 2.2 (Discrete Rewards) The reward distribution has finite and bounded support. The reward attains a value in the set Z. We assume that for all z Z we have |z| 1. We denote the cardinality of Z as Z and assume that Z = O(1).

As an example, Bernoulli distribution satisfies Assumption 2.2 with Z = {0, 1} and Z = 2. We denote the probability of observing a reward value z by playing an action a as µm(a, z) := P(r = z | m, a) in a context m. We often use µm as a reward-probability vector in RAZ indexed by a tuple (a, z) A Z.

At the beginning of every episode, a latent context m [M] is sampled from a mixing distribution {wm}M m=1 and fixed for H time steps, however we cannot observe m directly. We consider a policy class Π which contains all history-dependent policies π : (A Z) A. Our goal is to find a near optimal policy π Π that maximizes the expected cumulative reward V for each episode V = maxπ Π V (π) := Eπ h PH t=1 rt i , where the expectation is taken over latent contexts and rewards generated by an LMAB instance, and actions following a policy π.

Definition 2.3 (Approximate Planning Oracle) A planning oracle receives an LMAB instance B and returns an ϵ-approximate policy π such that V V (π) ϵ.

Concretely, the point-based value-iteration (PBVI) algorithm [33] is an ϵ-approximate planning algorithm which runs in time O(HMAZ(H2/ϵ)O(M)).

Additional notation. For any quantity q with respect to the true LMAB, B, we use ˆq to refer to its corresponding empirical estimate. We use wmin := minm [M] wm for the minimum mixing weight. For any lth-order tensor Tl Rn ... n (n repeated l times), we denote Tl for the element-wise largest absolute value. For any vector in v Rn in dimension n N+, we use v N l to denote a degree l tensorization of v. We often use µm(a, ) ˆµm(a, ) 1 to mean the l1 statistical distance between reward distributions: P

z Z |µm(a, z) ˆµm(a, z)|. Lastly, we denote by at and rt as the action and reward realizations observed at time step t [H].

3 Sample-Complexity of Learning LMABs

In this section, we develop our main algorithm that learns a near-optimal policy of an LMAB instance with O(poly(A)+poly(M, H)min(M,H)) samples. Towards achieving this goal, we first elaborate on a low rank representation of the LMAB problem. Then, we show how experimental design techniques can assist in utilizing this structure to improve over the naive upper bound to the problem.

3.1 Dimensionality Reduction via Experimental Design

When H = 1, estimating the latent context is not possible, and executing a standard MAB strategy (e.g., UCB) is optimal. However, to obtain a near-optimal history-dependent policy with longer time-horizons H > 1, tracking the mean reward of an action is not enough, since rewards from previous time steps are correlated with current and future rewards.

We consider a model-based learning approach from the lth-order moments of reward observations. Since there are A actions, this implies that we have O(Al) quantities to estimate, which would incur O(Al) sample complexity if we separately measure every correlation between l pairs of actions. On the other hand, when M A, the distributions µm (recall these are reward probabilities for each action, in each context) occupy only a M-dimensional subspace of AZ-dimensional space: U := span{µ1, µ2, . . . , µM}. Thus if we could estimate U, and project all observations to U, we could remove the unfavorable dependence on A.

However, even when the subspace U is known a priori, dimensionality reduction with bandit feedback is non-trivial. We need to compute directly the projection onto U of estimates of the reward probability vectors. To see the challenge, let {βj}M j=1 RAZ be an orthonormal basis of U. To compute the projected estimate µ mβj, we need a sampling policy π such that π(a) P

z |βj(a, z)|. The variance of this sampling policy can be as much as O( βj 2 1), which in general scales with βj 2 1 = O(A). Therefore, reliable estimation for any statistics in the reduced dimension would still have a dependence on A. In particular, since our approach is based on higher-order method-of-moments, an estimation of lth-order statistics for l 2 would require Ω(Al/2) samples.

The general idea of dimensionality reduction is not fatally flawed. Instead, we need to avoid the pitfalls of the approach outlined above. First, the calculation above shows that we need to control the 1-norm of the vectors βj, ideally, βj 1 = O(1). Second, we only need a good estimate for each µm such that maxa A µm(a, ) ˆµm(a, ) 1 ϵ to compute an ϵ-optimal policy.

The key is to show that we can choose βj s to be a subset of the standard basis in RAZ (hence they will have βj 1 = 1), in such a way that guarantees the approximation quality for µm from estimating µ mβj s. In terms of the original problem, the existence of such a subset of the standard basis is equivalent to the existence of a small set of informative action-value pairs, that are sufficiently correlated with all other action-value pairs. This is called a core set in the experimental design literature, and its existence in our context, is an important consequence of the Kiefer Wolfowitz theorem. Specifically, we can select a core set of coordinates with the following crucial lemma:

Lemma 3.1 Let U be a given k-dimensional linear subspace in Rd where d k and let u, u U. There exists an algorithm that runs in time O(dk2) and returns the following.

1. A core set of at most n = 4k log log k + 16 coordinates {ij}n j=1 [d] such that u u

2k maxj [n] |u(ij) u(ij)|.

2. A linear transformation that maps [ u(i1), . . . , u(in)] to its corresponding u U.

Note that in our setting, d = AZ and k = M. We prove Lemma 3.1 in Section C.1, essentially as a corollary of the Kiefer Wolfowitz theorem (and its geometric interpretation) for (near)-optimal experimental design [20, 39].

In the context of bandits and RL, the Kiefer Wolfowitz theorem has been used to study how the misspecification in linear representation changes the problem landscape (e.g., [27, 32]). However, experimental design has not been previously used for dimensionality reduction for problems with bandit feedback (as far as we know). We believe it is a powerful tool. We review the basics of experimental design in Appendix B.2.

A direct consequence of Lemma 3.1 is an algorithm to find a set of coordinates of size O(M) that are sufficient to reconstruct the latent reward model {µm}M m=1 where µm RAZ. We refer to this set of coordinates as the set of core action-value pairs.

Corollary 3.2 Suppose the subspace U = span{µ1, . . . , µM} is given. Then for any ˆµ U, there exists an algorithm that runs in time O(ZAM 2) and returns the following.

1. A set of core action-value pairs {(aj, zj)}n j=1 A Z of size at most n = 4M log log M + 16, such that for all m [M]

max a A µm(a, ) ˆµm(a, ) 1 2Z

2M max j [n] |νm(j) ˆνm(j)|,

where νm, ˆνm Rn such that νm(j) := µm(aj, zj) and ˆνm(j) := ˆµm(aj, zj).

2. A linear transformation that maps [ˆν(i1), . . . , ˆν(in)] to its corresponding ˆµ U.

Assuming access to the subspace U (we remove this assumption in Section 3.4), Corollary 3.2 implies a strategy for estimating the latent model parameters {µm}M m=1: obtain core action-value pairs {(aj, zj)}n j=1, estimate {νm}M m=1, and construct latent model parameters via a linear transformation of {νm}M m=1. In the following sections, we show that by matching higher-order moments we can estimate {νm}M m=1 to sufficiently good accuracy with only polynomial dependence in A.

3.2 H 2M 1: Identifiable Regime in Wasserstein Metric

In the regime where that H 2M 1 we can measure moments up to order (2M 1)th. Then, we can leverage recent advances in learning parameters of mixture distributions from higher-order moments [40, 12]. That is, we can recover {νm}M m=1 by estimating 2M 1 higher-order moments and find {ˆνm}M m=1 that matches them. This further emphasizes the importance of the dimensionality reduction we take to obtain the set of core action-values pairs; otherwise, a moment-based approach to recover {µm}M m=1 would have an exponential dependence in A.

We now elaborate on the estimation procedure of {νm}M m=1 from higher-order moments. We define the l-order tensor as Tl := PM m=1 wmν N l m . For an LMAB instance, we can access the tensor Tl using observational data by simply estimating correlations between l core action-value pairs within each episode. Specifically, for any I = (i1, i2, ..., il) [n]l the tensor Tl(i1, i2, ..., il) is also given by

Tl(i1, i2, ..., il) = EπI Πl t=11 {rt = zit} , (1) where πI is a policy that performs the sequence of actions (ai1, ai2, ..., ail) for t = 1, . . . , l.

Suppose we find estimators {( ˆwm, ˆνm)}M m=1 such that moments match Tl up to error δ for all l [2M 1]. The results in [40, 12] imply that the closeness in moments of up to (2M 1)th degree implies the closeness in model parameters:

Lemma 3.3 Suppose Tl PM m=1 ˆwmˆν N l m < δ for all l = 1, 2, ..., 2M 1 for some sufficiently small δ > 0. Then,

(m,m ) [M]2 Γ(m, m ) νm ˆνm O M 3n δ 1/(2M 1) , (2)

where Γ is a joint distribution over (m, m ) [M]2 satisfying

Γ(m, m ) RM M + : PM m =1 Γ(m, m ) = wm, PM m=1 Γ(m, m ) = ˆwm. (3)

The form of guarantee given for { ˆwm, ˆνm}M m=1 is in the Wasserstein distance between two latent model parameters. A useful property of the Wasserstein metric is that the distance measure is invariant to permutation of individual components, and flexible with arbitrarily small mixing probabilities or arbitrarily close components [40] (see also Appendix B.3 for the review on Wasserstein distance).

Once we obtain estimates of {νm}M m=1, we can estimate {ˆµm}M m=1 as implied by Corollary 3.2 (see Appendix C.8 for further details). We then show that for any history-dependent policy π, the expected cumulative rewards of π are approximately the same for any close LMAB instances.

Proposition 3.4 Let B = (A, {wm}M m=1, {µm}M m=1) and ˆB = (A, { ˆwm}M m=1, {ˆµm}M m=1) be any two LMABs. Then, for any history-dependent policy π : (A Z) A, we have

|V (π) ˆV (π)| H2 inf Γ

(m,m ) [M]2

Γ(m, m ) max a A µm(a, ) ˆµm (a, ) 1

where the infimum over Γ is taken over joint distributions over (m, m ) satisfying (3).

From Corollary 3.2 we have maxa A µm(a, ) ˆµm (a, ) 1 2Z

2M νm ˆνm for any m, m [M]. Plugging this into Proposition 3.4, we have

|V (π) ˆV (π)| poly(H, Z, M, n) δ 1/(2M 1).

Thus, using Lemma 3.3 with δ < (poly(H, Z, M, n)/ϵ)2M 1, we can conclude that any ϵ-optimal policy for ˆB is O (ϵ)-optimal for the underlying LMAB B.

3.3 H < 2M 1: Unidentifiable Regime with Short Time-Horizon

A more interesting regime is the one in which the time-horizon H is smaller than the required degree of moments 2M 1. For such a setting we cannot measure moments of degree higher than H. Therefore, if H < 2M 1, we cannot rely on the identifiablility of the underlying LMAB model.

Instead, we make the following claim: to compute an optimal policy only for time horizon H < 2M 1, we only need to match the Hth order moments. This is formalized in the next lemma.

Lemma 3.5 Suppose that PM m=1 wmµ N H m PM m=1 ˆwmˆµ N H m δ, then for any history dependent policy π Π, we have

|V (π) ˆV (π)| HZH δ.

Therefore, it is sufficient to find estimates of mixing weights and latent models that match the measured Hth-order moment, and then compute an optimal policy for the estimated model. Lemma 3.5 is natural for discrete reward distributions with bounded support. Interestingly, we extend this result to Gaussian rewards, which are continuous and unbounded, in Appendix D.

Hence, a natural idea is to estimate parameters { ˆwm}M m=1, {ˆµm}M m=1 that satisfy the condition in Lemma 3.5 without incurring O(AH) sample complexity. This is possible if we have good estimates of moment-matching parameters for a set of core action-values.

Proposition 3.6 For any given l 1, if PM m=1 wmν N l m PM m=1 ˆwmˆν N l m δ, then PM m=1 wmµ N l m PM m=1 ˆwmˆµ N l m (2M)l/2 δ.

By Proposition 3.6, we can conclude that it is sufficient to estimate TH := PM m=1 wmν N H m and find {( ˆwm, ˆνm)}M m=1 that matches TH element-wise up to accuracy δ := (ϵ/H)/(Z

3.4 Main Result

The sections above have outlined the key ideas we need assuming knowledge of U. In this section, we show how we can estimate U, and we describe the complete procedure that learns a near-optimal policy of an LMAB instance. We give the details in Algorithm 1. Our algorithm is divided to three steps as detailed below.

Step 1: Estimating U. Algorithm 1 first estimates the second-order moments M2 = PM m=1 wmµmµ m. Let b U be the top-M eigenvectors of an empirical estimate of M2, where ˆ M2 is constructed by collecting samples of reward correlations by taking random actions at first two time steps. Note that b U may not be a good proxy for some µm with small mixing probability wm 0, and thus all elements in U are not necessarily close to b U. We defer the details of subspace recovery procedure to the proof of the following lemma in Appendix C.7.

Lemma 3.7 Let b U be a subspace spanned by top-M eigenvectors of ˆ M2. After we estimate ˆ M2 using N0 = O(A4Z2 log(ZA/η)/δ4 sub) episodes, with probability at least 1 η, for all m [M], there exists m : m δsub/w1/2 m such that µm + m b U.

Our choice of δsub differs in two regimes as the following:

δsub = ϵ 2ZMH2 , if: H 2M 1,

δsub = min q

wmin + ϵ/(MH2(Z

2M)H), ϵ/(H

MH , else: H < 2M 1, (5)

where wmin = minm [M] wm. The main difference in two regimes is that for H 2M 1, when the parameters are identifiable, latent contexts with small mixing probabilities can be ignored

Algorithm 1

1: Input: Accuracy level ϵ, δsub, δtsr > 0, model parameters M, A, Z, H. 2: // Step 1: Estimate subspace U. 3: Construct ˆ M2, an estimate of M2 = PM m=1 wmµmµ m using N0 = O(A4Z2/δ4 sub) episodes.

4: Calculate b U, the span of top-M eigenvectors of ˆ M2. 5: // Step 2: Estimate {(wm, νm)}M m=1. 6: Get core action-value pairs {(aj, zj)}n j=1 by calling Corollary 3.2.

7: Construct { ˆTl}min(H,2M 1) l=1 using N = O(nmin(2M 1,H) M 2/δ2 tsr) episodes. 8: Find valid empirical parameters {( ˆwm, ˆνm)}M m=1 satisfying (7). 9: // Step 3: Use the core-action value pairs to construct an empirical LMAB. 10: Construct empirical model ˆB = (A, { ˆwm}M m=1, {ˆµm}M m=1). 11: Output: optimal policy of ˆB by calling a planning oracle (Definition 2.3) for ˆB.

once wm = o(ϵ/M) to guarantee the closeness in distributions of observations. However, in the parameter unidentifiable regime, total variation distance is bounded only through errors in the moment space. Therefore, the estimated subspace needs to be accurate even for contexts with small mixing probabilities to keep the higher-order moments well approximated. Note that for instances with well-balanced mixing probabilities, i.e., if wmin = Ω(1/M), the order of δsub remains the same as in the H 2M 1 case.

Step 2: Moment Matching. Given b U, the subspace spanned by top-M eigenvectors of ˆ M2, Algorithm 1, follows the procedure described in Sections 3.1-3.3. It constructs the set of (approximate) core action-values pairs {(aj, zj)}n j=1 (see Appendix C.8.1 for the detailed algorithm). Then, we construct higher-order moments of the core action-value pairs. For every multi-index (i1, i2, ..., il) [n]l, using N1 = O(log(lnl/η)/δ2 tsr) episodes, we execute ak t = ait for t = 1, ..., l and estimate higher-order moments (as also described in equation (1))

ˆTl(i1, i2, ..., il) = 1

k=1 Πl t=11 rk t = zit . (6)

Using standard concentration inequalities and applying the union bounds over all elements in tensors, we get ˆTl Tl < δtsr with probability at least 1 η. Then we find empirical parameters {( ˆwm, ˆνm)}M m=1 that satisfy PM m=1 ˆwmˆν N l m ˆTl < δtsr, l [min(H, 2M 1)]. (7)

We set δtsr = O(ϵ/(ZH2M 3.5n))2M 1 when H 2M 1. In the parameter unidentifiable regime H < 2M 1, we set δtsr = O(ϵ/H)/(Z

Step 3: Constructing Empirical LMAB. Finally, Algorithm 1 uses the estimates {( ˆwm, ˆνm)}M m=1 to construct an empirical model ˆB = (A, { ˆwm}M m=1, {ˆµm)}M m=1) after proper clipping and normalization. For this step to succeed, we require ˆwm and ˆνm to be valid parameters for the reconstruction of a valid empirical model. We state details on the recovery procedure in Appendix C.8.1.

Once a valid empirical model ˆB is obtained, the remaining step is to call the planning oracle that gives an ϵ-approximate optimal policy for ˆB. We conclude this section with an end-to-end guarantee.

Theorem 3.8 For any LMAB instance with M latent contexts, with probability at least 1 η, Algorithm 1 returns an ϵ-optimal policy given total number of episodes of at most

poly(H, M, Z, A, 1/ϵ) log(AZ/η) + poly(H, Z, M)2M 1 log(M/η)/ϵ4M 2, if H 2M 1,

poly(H, w 1 min, Z, A, 1/ϵ) log(AZ/η) + H2 log(M/η) 2MZ2 H /ϵ2, otherwise.

Note that the dependency on A is polynomial. This polynomial term is needed to control the subspace estimation error. The exponential term is derived from the closeness in moments as discussed earlier.

Remark 3.9 (Note on wmin) Ideally, mixing weights are well balanced such that wmin = Ω(1/M). However, in general we may have arbitrarily small wmin. Note that when H 2M 1, the upper bound does not depend on wmin. However when H < 2M 1 with arbitrarily small wmin, effectively we have wmin ϵ/(Z

M)H as can be seen in equation (5). This is mainly because in the parameter unidentifiable regime, we give guarantees from moment-closeness, and thus the subspace has to be very accurate to approximate higher-order moments well. Nevertheless, we are not aware whether this is tight, and believe this to be an interesting question for future research.

Continuous Rewards Gaussian Case. So far we have focused on rewards with finite support Z = O(1). However, some steps in the algorithm cannot be straightforwardly extended to continuous reward distributions. In Appendix D, we show a similar upper bound that is at most

poly(H, M, A, 1/ϵ) log(A/η) + poly (H, M, log(1/(ηϵ)))min(H,M) /ϵmin(2H+2,4M 2),

assuming the rewards are Gaussian with unknown mean (see Theorem D.2 for the exact upper bound).

4 Maximum Likelihood Implementation

In the previous section, we derived a procedure that learns a near-optimal policy of an LMAB instance. However, our procedure relies on matching higher-order moments, which, in general, is not suitable for practical implementation. In this section, we describe a maximum likelihood (MLE) method, motivated by our previous results. Importantly, we can use the Expectation-Maximization (EM) [11] heuristic to find an approximate solution to the MLE optimization problem.

We can start from the set of core action-value pairs {(aj, zj)}n j=1 given by Corollary 3.2. Now for every time step t, we choose it randomly from a uniform distribution over [n], play action ait, and observe bt := 1 {rt = zit}. After repeating this for N episodes, we formulate the log-likelihood function with parameterization θ = {(wm, νm)}M m=1 as follows:

l N (θ) := 1

m=1 wmΠH t=1(btνm(it) + (1 bt)(1 νm(it)))

To prevent the confusion with searching parameters, we use q to denote any quantity q constructed with ground truth parameters, e.g., θ := {(w m, ν m)}M m=1. Let θN be the maximum likelihood estimator, i.e., θN = arg maxθ Θ l N(θ) in some valid parameter set Θ.

We can recover the sample-complexity guaranteed by moment-matching methods studied in the previous section. Specifically, we show that the maximum likelihood estimator with sufficiently many samples have nearly matching moments:

Lemma 4.1 Consider the maximum likelihood estimator θN = {( ˆwm, ˆνm)}M m=1 with N episodes for large enough N. If N C nmin(2H+1,4M 1) log(N/η)/δ2 for some sufficiently large constant C > 0, then with probability at least 1 η,

T l ˆTl δ, l [min(H, 2M 1)]. (9)

Therefore, by setting the same accuracy parameter δ used in Algorithm 1, we can obtain the required θN = {( ˆwm, ˆνm)}M m=1 for matching moments. We only replace the moment-matching step (Step 2 in Algorithm 1) with solving the MLE optimization problem (8).

Implementation Details While the log-likelihood formulation (8) is still non-convex and thus intractable, we can rely on powerful heuristics: we initialize the parameters with clustering methods [2] or spectral methods [1, 7], and run the EM algorithm to improve the accuracy [11] (regardless of conditions or guarantees). Therefore, while the sample-complexity upper bound for both methods remains the same, we benefit from the log-likelihood formulation despite the non-convexity.

Specifically, we follow the same steps in Algorithm 1. We first find the core action-event pairs from second-order moments as described (Step 1). Then we form the maximum log-likelihood objective (8) and find an approximate MLE solution with the EM algorithm (Step 2). After obtaining θN, we recover an empirical model which we use as an input to an approximate planning oracle (Step 3). We

Figure 1: Per time-step rewards for increasing lengths of episodes with history-dependent policies returned after the exploration phase

compare our MLE-based implementation with experimental design (ED + MLE) to other baselines applicable to learning in LMABs (e.g., L-UCRL + EM [24], tensor-decomposition methods [1, 7, 41], naive-UCB [6]). In our experimental results, we can observe that in practice, our method (ED+MLE) outperforms other baselines and the worst case guarantee given in Theorem 3.8.

Provable Benefits of MLE In Appendix E.2, we also show that MLE solutions can automatically adapt to mild separation conditions. Specifically, we show that if H = O(M 2/γ2) for some separation parameter γ > 0 (see Assumption E.1), then we can achieve the polynomial sample complexity for learning a near-optimal policy with the MLE solution θN.

In Figure 1, we compare the averaged per time-step rewards obtained with returned policies from different algorithms after some exploration episodes. Note that with increasing H, it becomes easier to identify contexts, i.e., we have bigger separations, and thus MLE based approaches can benefit from larger separations. Due to space constraints, we defer the remaining details to Appendix A.

5 Conclusion and Future Work

In this work, we designed an algorithm that learns a near-optimal policy for an LMAB instance that requires only poly(A) + poly(M, H)min(H,2M 1) number of samples. We achieved this with the experimental design and method-of-moments, and the maximum likelihood estimation which is more suitable in practice. We discuss a few limitations of this work and future directions below.

Lower Bounds The question of a lower bound on the sample complexity of the LMAB problem remains unresolved. For LMDP, an MDP extension of LMAB, a lower bound of Ω(AM) is known due to [24]. While we conjecture that some exponential dependence in M is unavoidable, characterizing the minimax dependence of the exponent is left as an interesting future research question.

Latent MDPs We believe that the moment-matching based approach we took in this work offers a promising way for designing RL algorithms in the presence of latent contexts. Specifically, we believe these techniques can be used to design algorithm that finds a near-optimal policy of LMDPs [24] or reward-mixing MDPs [23] with M = O(1) and without further separation assumptions.

Linear Bandits / Continuous Rewards Our work has focused on the tabular setting, where all arms are independent. It will be an interesting future work to consider the same objective in linear bandit settings. Also, while we only considered Gaussian rewards, we believe our approach can also be extended to a broader class of parametric distributions. Investigating more general classes of rewards is an important future research direction from practical perspective.

Acknowledgement

This research was partially funded by the NSF IFML Institute (NSF 2019844), the NSF AI-EDGE Institute (NSF 2112471), and NSF 1704778, and was also supported by the Israel Science Foundation (grant No. 2199/20).

[1] A. Anandkumar, R. Ge, D. Hsu, S. M. Kakade, and M. Telgarsky. Tensor decompositions for learning latent variable models. The Journal of Machine Learning Research, 15(1):2773 2832, 2014.

[2] D. Arthur and S. Vassilvitskii. k-means++ the advantages of careful seeding. In Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms, pages 1027 1035, 2007.

[3] J.-Y. Audibert, S. Bubeck, et al. Minimax policies for adversarial and stochastic bandits. In COLT, volume 7, pages 1 122, 2009.

[4] P. Auer, N. Cesa-Bianchi, and P. Fischer. Finite-time analysis of the multiarmed bandit problem. Machine learning, 47(2):235 256, 2002.

[5] P. Auer, N. Cesa-Bianchi, Y. Freund, and R. E. Schapire. The nonstochastic multiarmed bandit problem. SIAM journal on computing, 32(1):48 77, 2002.

[6] M. G. Azar, I. Osband, and R. Munos. Minimax regret bounds for reinforcement learning. ar Xiv preprint ar Xiv:1703.05449, 2017.

[7] K. Azizzadenesheli, A. Lazaric, and A. Anandkumar. Reinforcement learning of pomdps using spectral methods. In Conference on Learning Theory, pages 193 256, 2016.

[8] A. Bhaskara, M. Charikar, A. Moitra, and A. Vijayaraghavan. Smoothed analysis of tensor decompositions. In Proceedings of the forty-sixth annual ACM symposium on Theory of computing, pages 594 603. ACM, 2014.

[9] E. Brunskill and L. Li. Sample complexity of multi-task reinforcement learning. In Uncertainty in Artificial Intelligence, page 122. Citeseer, 2013.

[10] R. Chawla, A. Sankararaman, A. Ganesh, and S. Shakkottai. The gossiping insert-eliminate algorithm for multi-agent bandits. In International Conference on Artificial Intelligence and Statistics, pages 3471 3481. PMLR, 2020.

[11] A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society: Series B (Methodological), 39(1):1 22, 1977.

[12] N. Doss, Y. Wu, P. Yang, and H. H. Zhou. Optimal estimation of high-dimensional gaussian mixtures. ar Xiv preprint ar Xiv:2002.05818, 2020.

[13] A. Garivier and E. Moulines. On upper-confidence bound policies for switching bandit problems. In International Conference on Algorithmic Learning Theory, pages 174 188. Springer, 2011.

[14] C. Gentile, S. Li, and G. Zappella. Online clustering of bandits. In International Conference on Machine Learning, pages 757 765, 2014.

[15] A. Gopalan, O.-A. Maillard, and M. Zaki. Low-rank bandits with latent mixtures. ar Xiv preprint ar Xiv:1609.01508, 2016.

[16] A. Hallak, D. Di Castro, and S. Mannor. Contextual markov decision processes. ar Xiv preprint ar Xiv:1502.02259, 2015.

[17] J. Hu, X. Chen, C. Jin, L. Li, and L. Wang. Near-optimal representation learning for linear bandits and linear RL. In International Conference on Machine Learning, pages 4349 4358. PMLR, 2021.

[18] T. Jaksch, R. Ortner, and P. Auer. Near-optimal regret bounds for reinforcement learning. Journal of Machine Learning Research, 11:1563 1600, 2010.

[19] C. Jin, S. Kakade, A. Krishnamurthy, and Q. Liu. Sample-efficient reinforcement learning of undercomplete pomdps. Advances in Neural Information Processing Systems, 33:18530 18539, 2020.

[20] J. Kiefer and J. Wolfowitz. The equivalence of two extremum problems. Canadian Journal of Mathematics, 12:363 366, 1960.

[21] B. Kveton, M. Konobeev, M. Zaheer, C.-w. Hsu, M. Mladenov, C. Boutilier, and C. Szepesvari. Meta-thompson sampling. In International Conference on Machine Learning, pages 5884 5893. PMLR, 2021.

[22] J. Kwon and C. Caramanis. The EM algorithm gives sample-optimality for learning mixtures of well-separated gaussians. In Conference on Learning Theory, pages 2425 2487, 2020.

[23] J. Kwon, Y. Efroni, C. Caramanis, and S. Mannor. Reinforcement learning in reward-mixing mdps. Advances in Neural Information Processing Systems, 34, 2021.

[24] J. Kwon, Y. Efroni, C. Caramanis, and S. Mannor. RL for latent mdps: Regret guarantees and a lower bound. Advances in Neural Information Processing Systems, 34, 2021.

[25] J. Kwon, Y. Efroni, C. Caramanis, and S. Mannor. Coordinated attacks against contextual bandits: Fundamental limits and defense mechanisms. In Proceedings of the 39th International Conference on Machine Learning, pages 11772 11789. PMLR, 2022.

[26] T. Lattimore and C. Szepesvári. Bandit algorithms. Cambridge University Press, 2020.

[27] T. Lattimore, C. Szepesvari, and G. Weisz. Learning with good feature representations in bandits and in rl with a generative model. In International Conference on Machine Learning, pages 5662 5670. PMLR, 2020.

[28] M. L. Littman, A. R. Cassandra, and L. P. Kaelbling. Learning policies for partially observable environments: Scaling up. In Machine Learning Proceedings 1995, pages 362 370. Elsevier, 1995.

[29] C.-Y. Liu and L. Li. On the prior sensitivity of thompson sampling. In International Conference on Algorithmic Learning Theory, pages 321 336. Springer, 2016.

[30] Q. Liu, A. Chung, C. Szepesvári, and C. Jin. When is partially observable reinforcement learning not scary? ar Xiv preprint ar Xiv:2204.08967, 2022.

[31] O.-A. Maillard and S. Mannor. Latent bandits. In International Conference on Machine Learning, pages 136 144, 2014.

[32] A. Modi, N. Jiang, A. Tewari, and S. Singh. Sample complexity of reinforcement learning using linearly combined model ensembles. In International Conference on Artificial Intelligence and Statistics, pages 2010 2020. PMLR, 2020.

[33] J. Pineau, G. Gordon, and S. Thrun. Anytime point-based approximations for large pomdps. Journal of Artificial Intelligence Research, 27:335 380, 2006.

[34] W. Schudy and M. Sviridenko. Concentration and moment inequalities for polynomials of independent random variables. In Proceedings of the twenty-third annual ACM-SIAM symposium on Discrete Algorithms, pages 437 446. SIAM, 2012.

[35] M. Simchowitz, C. Tosh, A. Krishnamurthy, D. J. Hsu, T. Lykouris, M. Dudik, and R. E. Schapire. Bayesian decision-making under misspecified priors with applications to meta-learning. Advances in Neural Information Processing Systems, 34, 2021.

[36] A. Slivkins and E. Upfal. Adapting to a changing environment: the brownian restless bandits. In COLT, pages 343 354, 2008.

[37] L. N. Steimle, D. L. Kaufman, and B. T. Denton. Multi-model markov decision processes. Optimization Online URL http://www. optimization-online. org/DB_FILE/2018/01/6434. pdf, 2018.

[38] W. R. Thompson. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika, 25(3-4):285 294, 1933.

[39] M. J. Todd. Minimum-volume ellipsoids: Theory and algorithms. SIAM, 2016.

[40] Y. Wu, P. Yang, et al. Optimal estimation of gaussian mixtures via denoised method of moments. Annals of Statistics, 48(4):1981 2007, 2020.

[41] X. Zhou, Y. Xiong, N. Chen, and X. Gao. Regime switching bandits. In Thirty-Fifth Conference on Neural Information Processing Systems, 2021.

6 Checklist

1. For all authors...

(a) Do the main claims made in the abstract and introduction accurately reflect the paper s contributions and scope? [Yes] (b) Have you read the ethics review guidelines and ensured that your paper conforms to them? [Yes] (c) Did you discuss any potential negative societal impacts of your work? [N/A] : This work is primarily focused on the theoretical aspect and its purpose is too general for us to speculate about broader societal impacts. (d) Did you describe the limitations of your work? [Yes] 2. If you are including theoretical results...

(a) Did you state the full set of assumptions of all theoretical results? [Yes] (b) Did you include complete proofs of all theoretical results? [Yes] : in Supplementary Materials 3. If you ran experiments...

(a) Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [Yes] (b) Did you specify all the training details (e.g., data splits, hyperparameters, how they were chosen)? [Yes] : in Supplementary Materials (c) Did you report error bars (e.g., with respect to the random seed after running experiments multiple times)? [No] : We did average errors over multiple independent experiments, but error bars are not reported due to computational burden. (d) Did you include the amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)? [N/A]

The remaining issues are not applicable to us.