# adaptive_sampling_for_estimating_probability_distributions__5105e16e.pdf Adaptive Sampling for Estimating Probability Distributions Shubhanshu Shekhar 1 Tara Javidi 1 Mohammad Ghavamzadeh 2 Abstract We consider the problem of allocating a fixed budget of samples to a finite set of discrete distributions to learn them uniformly well (minimizing the maximum error) in terms of four common distance measures: ℓ2 2, ℓ1, f-divergence, and separation distance. To present a unified treatment of these distances, we first propose a general optimistic tracking algorithm and analyze its sample allocation performance w.r.t. an oracle. We then instantiate this algorithm for the four distance measures and derive bounds on their regret. We also show that the allocation performance of the proposed algorithm cannot, in general, be improved, by deriving lower-bounds on the expected deviation from the oracle allocation for any adaptive scheme. We verify our theoretical findings through some experiments. Finally, we show that the techniques developed in the paper can be easily extended to learn some classes of continuous distributions as well as to the related setting of minimizing the average error (rather than the maximum error) in learning a set of distributions. 1. Introduction Consider the problem in which a learner must allocate n samples among K discrete distributions to construct uniformly good (minimizing the maximum error) estimates of these distributions in terms of a distance measure D. Depending on D, certain distributions may require much fewer samples than the others to be estimated with the same precision. The optimal sampling strategy for a given n requires knowledge of the true distributions. The goal of this paper is to design adaptive allocation strategies that converge to the optimal strategy, oblivious to the true distributions. The problem described above models several applications which are not captured by existing works. Here, we describe 1ECE Department, University of California, San Diego 2Google Research. Correspondence to: Shubhanshu Shekhar . Proceedings of the 37 th International Conference on Machine Learning, Online, PMLR 119, 2020. Copyright 2020 by the author(s). some such applications. (1) Opinion Polling: Suppose there are K groups of voters who have different preference distributions over l candidates in an election. While some groups might heavily favour a single candidate, others might be indifferent, resulting in a more uniform distribution over the set of candidates. In this setting, how should the polling agency allocate its sampling budget? Intuitively, more samples should be allocated to the indifferent voter groups to construct uniformly good estimates of their preference distributions. (2) Compression of text files: Given a sampling budget of n bytes, consider the problem of designing codes with minimum average length for text files in K different languages. Since different languages may have different symbol frequencies, this can be formulated as learning K distributions uniformly well in terms of certain f-divergences. (3) Learning MDP model: In many sequential decision-making problems, the agent s interaction with the environment is modeled as a Markov decision process (MDP). In these problems, it is often important to accurately estimate the dynamics (i.e., the transition structure of the MDP), given a finite exploration budget. Learning the MDP model is equivalent to estimating S A distributions, where S and A are the number of states and actions of the (finite) MDP. Therefore, assuming the existence of a known policy that can efficiently transition the MDP between any two states, the problem reduces to finding the optimal allocation of samples to these S A distributions. Thus, the framework studied in this paper provides the first step towards solving the general problem of constructing accurate models for MDPs. The requirement of a known policy to transition between states can be relaxed by employing the techniques recently developed for efficient exploration in MDPs (e.g., Tarbouriech & Lazaric 2019; Hazan et al. 2019; Cheung 2019), which we leave for future work. Antos et al. (2008) were the first to study the problem of learning the mean of K distributions uniformly well, and proposed and analyzed an algorithm based on forced exploration strategy. Carpentier et al. (2011) proposed and analyzed an alternative approach for the same problem, based on the UCB algorithm (Auer et al., 2002). Carpentier & Munos (2011) analyzed an optimistic policy for the related problem of stratified-sampling, where the goal is to learn K distributions in terms of a weighted average distance (instead of max). Soare et al. (2013) extended the optimistic strategy to the case of uniformly estimating K Adaptive Sampling for Distribution Estimation linearly correlated distributions. Riquelme et al. (2017) applied the optimistic strategy to the problem of allocating covariates (drawn in an i.i.d. manner from some distribution) for uniform estimation of K linear models. The prior work mentioned above have focused solely on estimating the means of distributions in squared error sense, and their analytic techniques do not extend to learning entire distributions. In this paper, we generalize the above-mentioned prior work by considering the problem of active sampling to uniformly learn K distributions in terms of pre-specified distance measures on the space probability distributions. Overview of Results. Intuitively, the optimal allocation should equalize the expected distance between the true distribution and the resulting empirical estimate for all the K distributions. This allocation, however, may have a complex dependence on the true distribution, Pi, for 1 i K. Our approach in this paper is to first identify an objective function which (i) is a good approximation of the true objective given a distance measure D, (ii) depends on the original distribution Pi through a single real-valued parameter ci, and (iii) has a decoupled dependence on ci and Ti. In Sec. 3, we formally define an appropriate function class F within which the objective functions for various distance measures should lie. We then propose a generic optimistic tracking strategy (Alg. 1) which addresses the trade-off in constructing better estimates of the parameter ci, and using the existing estimates of ci to drive the allocation towards the optimal. We also obtain a general bound on its deviation from an (approx-) oracle allocation (defined in Sec. 3). In Sec. 4, we first present a road-map for designing adaptive sampling schemes for arbitrary loss functions using the results of Sec. 3, and then specialize this to the case of four widely-used distance measures: ℓ2 2, ℓ1, f-divergence, and separation distance. For each distance measure, we obtain bounds on the regret of the proposed sampling scheme w.r.t. an oracle strategy. In Sec. 5, derive matching lowerbounds on the expected deviation from oracle allocation for any algorithm. Experiments with synthetic examples in Sec. 6 validate our theoretical results. Finally, we discuss how our techniques can be extended to learning some classes of continuous distributions as well as to the related problem of minimizing the average error in Sec. 7. Technical Contributions. The results of this paper require generalizing existing techniques, as well as introducing new methods. More specifically, the proof of Theorem 1 abstracts out the arguments of Carpentier et al. (2011, Thm. 1) to deal with a much larger class of objective functions. Prior work with mean-squared error (Antos et al., 2008; Carpentier et al., 2011) required bounding the first and second moments of random sums that could be achieved by a direct application of Wald s equations (Durrett, 2019, Thm. 4.8.6). Our results on f-divergence (Thm. 7 and Lemma 9 in Appendix F) require analyzing higher moments of random sums for which Wald s equations are not applicable. Deriving the approximate objective function for separation distance involves estimating the expectation of the maximum of some correlated random variables. We obtain upper and lower bounds on this expectation in Lemma 6 by first approximating the maximum with certain sums, and then bounding the sums using a normal approximation result (Ross, 2011, Thm. 3.2). 2. Problem Setup Consider K discrete distributions, (Pi)K i=1, that belong to the (l 1)-dimensional probability simplex l, and take values in the set X = {x1, . . . , xl}. Each distribution Pi is equivalently represented by a vector Pi = (pi1, . . . , pil) with pij 0, j [l], and Pl j=1 pij = 1. For any integer b > 0, we denote by [b], the set {1, . . . , b}. Given a budget of n K samples, we consider the problem of allocating samples to each of the K distributions in such a way that the maximum (over the K distributions) discrepancy between the empirical distributions (estimated from the samples) and the true distributions is minimized. To formally define this problem, suppose an allocation scheme assigns (Ti)K i=1 samples to the K distributions, such that Ti 0, i [K], and PK i=1 Ti = n. Also suppose that ˆPi is the empirical distribution with ˆpij = Tij/Ti, where Tij denotes the number of times the output xj was observed in the Ti draws from Pi, and D : l l 7 [0, ) is a distribution distance measure. Then, our problem of interest can be defined as finding an allocation scheme (Ti)K i=1 that solves the following constrained optimization problem: min T1,...,TK max i [K] E D( ˆPi, Pi) , s.t. i=1 Ti = n. (1) We refer to the (non-integer) solution of (1) with full knowledge of (Pi)K i=1 as the oracle allocation (T i )K i=1. It is important to note that (T i )K i=1 ensure that the objective functions γi(Ti) := E D( ˆPi, Pi) are equal, for all i [K]. However, in practice, (Pi)K i=1 are not known. In this case, we refer to (1) as a tracking problem in which the goal is to design adaptive sampling strategies that approximate the oracle allocation using running estimates of (Pi)K i=1. Choice of the Distance Measure. It is expected that the optimal allocation will be strongly dependent on the distance measure D. We study four distances: ℓ2 2, ℓ1 or total variation (TV), f-divergence, and separation distance in this paper. These distances include all those in (Gibbs & Su, 2002) that do not require a metric structure on X. The f-divergence family generalizes the well-known KL-divergence (DKL) and includes a number of other common distances, such as total variation (DTV), Hellinger (DH), and chi-square (Dχ2). Applications of f-divergence include source and channel coding problems (Csisz ar, 1967; 1995), testing goodness-of-fit (Gyorfiet al., 2000), and distribution estimation (Barron et al., 1992). The common f-divergences Adaptive Sampling for Distribution Estimation mentioned above satisfy the following chain of inequalities: DTV DH DKL p Dχ2, that define a hierarchy of convergence among these measures (Tsybakov, 2009, Eq. 2.27). The separation distance Ds(P, Q) (defined formally in Sec. 4.5) arises naturally in the study of the convergence of symmetric Markov chains to their stationary distribution. More specifically, if Q is the stationary distribution of a Markov chain and (Pt)t 1 is its state distribution at time t, such that Q = PT at a random time T, then Ds(Pt, Q) P(T > t) (Aldous & Diaconis, 1987, Sec. 3). Choice of estimator. In this work, we fix the estimated distribution ˆPi to be the empirical distribution, i.e., ˆPi = [ˆpij]l j=1 where ˆpij = Tij/Ti. While the empirical estimator is known to be suboptimal in a min-max sense (Kamath et al., 2015), the additional error due to the deviation of E[D( ˆPi, Pi)] for some of the above distances (ℓ2 2, ℓ1 and f-divergence) does not change the final regret obtained. For instance, for the ℓ2 2 distance, the results of (Kamath et al., 2015) show that E[Dℓ2 2( ˆPi, Pi)] differs from the min-max value by a O n 3/2 term. Since this term is of the same order as the regret we derive in Theorem 2, we conclude that for this loss the regret cannot be improved by using the min-max optimal estimator estimator. Similar results can be shown for ℓ1 distance and the f divergence family as well. Allocation Scheme and Regret. An adaptive allocation scheme A consists of a sequence of mappings (πt)t 1, where each mapping πt : N (X [K])t 1 7 [K] selects an arm to pull1 at time t, based on the budget n and the history of pulls and observations up to time t. For an allocation scheme A, a sampling budget n, and a distance measure D, we define the risk incurred by A as Ln(A, D) = max i [K] E D( ˆPi, Pi) . (2) We denote by A , the oracle allocation rule. The performance of an allocation scheme A is measured by its suboptimality or regret w.r.t. A , i.e., Rn(A, D) := Ln(A, D) Ln(A , D). (3) Notations.2 For 0 < η < 1/2, we define the η-interior of (l 1)-dimensional simplex l, as (η) l := {P l | η pj 1 η, j [l]}. We use the Bernoulli random variable Z(s) ij to represent the indicator that the sth draw from arm i is equal to xj X. Note that for any draw s, we have E[Z(s) ij ] = pij. For any t [n], we define Wij,t = Pt s=1 Z(s) ij , where Z(s) ij := Z(s) ij pij is a centered Bernoulli variable. We also note that several terms such as ϕ, A, B, and en (to be introduced in Sec. 3) are overloaded for different distance measures. For instance, we use ϕ for 1Each distribution can be considered as an arm, and thus, we use the terms sampling from a distribution and pulling an arm interchangeably throughout the paper. 2See Table A.1 in App. A for a list of all the notations used. both ℓ1 and KL-divergence, instead of writing ϕ(ℓ1) and ϕ(KL). The meaning should be clear from the local context. 3. General Allocation via Optimistic Tracking Before proceeding to the analysis of problem (1) for specific distance measures, we first study an abstract yet more stylized class of problems similar to (1), where the dependency of the objective (loss) functions on the distribution parameter versus number of allocated samples can be explicitly decoupled. In particular, let us consider the problem in which the objective functions satisfy certain regularity conditions that we define next. Definition 1. We use F to denote the class of functions ϕ : R R 7 R satisfying the following properties: 1) ϕ( , T) is concave and non-decreasing for all T R, 2) ϕ(c, ) is convex and non-increasing for all c R, and 3) ϕ(c, ) and ϕ( , T) are differentiable for all c, T (0, ). We now can define an analog of the optimization problem (1) with the objective function belongs to F: min T1,...,TK max i [K] ϕ(ci, Ti), s.t. i=1 Ti = n, (4) where the parameters (ci)K i=1 depend solely on the distance measure D and distributions (Pi)K i=1. Note that in this setting the budget allocation reduces to balancing the value of the objective function by tracking the distribution-dependent parameter (ci)K i=1 (to be estimated). We refer to the solution of (4) with full knowledge of (ci)K i=1 as ( e T i )K i=1, and to the corresponding allocation scheme as e A . Similar to (1), when parameters (ci)K i=1 are unknown, we refer to (4) as a general tracking problem. Optimistic Tracking Algorithm. We now propose and analyze an adaptive sampling scheme, motivated by the upper-confidence bound (UCB) algorithm (Auer et al., 2002) in multi-armed bandits, for solving the general tracking problem (4). The proposed scheme, whose pseudo-code is shown in Algorithm 1, samples optimistically by plugging in high probability upper-bounds of ci in the objective function ϕ. Formally, for each arm i [K] and time t [n], we denote by Ti,t, the number of times that arm i has been pulled prior to time t. We define the (1 δ)-probability (high probability) event E := T i [K] |ˆci,t ci| ei,t) , where ˆci,t is the empirical estimate of ci and ei,t is the radius (half of the length) of its confidence interval at time t computed using Ti,t samples. We define the upper-bound of ci at time t as ui,t := ˆci,t + ei,t with the convention that ui,1 = + . In the rest of the paper, we use ˆPi,t and ˆpij,t to represent the estimates of Pi and pij at time t, computed by Ti,t i.i.d. samples. We now state a theorem that bounds the deviation of the allocation obtained by Algorithm 1 (our optimistic tracking algorithm), (Ti)K i=1, from the allocation ( e T i )K i=1, i.e., the solution to (4) when the parameters (ci)K i=1 are known. Before Adaptive Sampling for Distribution Estimation Algorithm 1 Optimistic Tracking Algorithm 1: Input: K, n, δ 2: Initialize t 1; 3: while t n do 4: if t K then 5: It = t; 6: else 7: It = arg maxi [K] ϕ(ui,t, Ti,t); 8: end if 9: Observe X PIt; t t + 1; 10: Update ui,t, i [K]; 11: end while stating our main theorem, we define g i := ϕ(c, e T i ) c c=ci and h i := ϕ(ci,T ) T T = e T i . Theorem 1. Define A := maxi [K] g i , B := maxi [K] h i , and en := maxi [K] e i , where e i is the radius of the confidence interval of arm i after e T i pulls. Then, under the event E, and assuming that B > 0 and e T i > 1, i [K], we have B Ti e T i 2A(K 1) en The proof of Theorem 1, given in Appendix C, generalizes the arguments used in Carpentier et al. (2011, Thm. 1) to handle any objective function ϕ F. The idea behind preceding discussion is the following: in cases where the objective function γi in (1) lies in F, we can use the result of Theorem 1 to obtain a bound on the deviation of the allocation of the resulting Algorithm 1 from the oracle deviation. In other cases, we can select an appropriate approximation of γi within the function class F, and then use Theorem 1 in conjunction with a regret decomposition result (Lemma 1) to obtain the required regret bounds. 4. Adaptive Allocation Algorithms Algorithm 1 along with the corresponding Theorem 1 provide us with a road-map to design adaptive sampling algorithms for the tracking problem (1) for different choices of distribution distance D. 4.1. Road-map We proceed in the following steps: Step 1: If γi := E[D( ˆPi, Pi)] F (Def. 1), then derive an approximation of γi( ), denoted by ϕ(ci, ) lying in F. If γi F, then set ϕ(ci, ) = γi( ). Step 2: Construct an appropriate UCB for the parameter ci for i [K], to instantiate Algorithm 1, and use Theorem 1 to get a bound on the deviation of the allocation of Algorithm 1 from optimal. Step 3: Derive an upper-bound on the regret by employing the decomposition given in Lemma 1 below, along with some distance-specific analysis. In the sequel, we shall refer to ( e T i )K i=1, the optimal solutions to (4), as the approx-oracle allocation, and the corresponding (non-adaptive) strategy, e A , as the approx-oracle allocation rule. We now present the key regret decomposition result that will be used in deriving the regret bounds for the cases where γi F and an approximation ϕ(ci, ) is used in Algorithm 1. Lemma 1. For any allocation scheme A, a distance measure D, and sampling budget n, define e Rn(A, D) = Ln(A, D) Ln( e A , D) and Ri(T) := |γi(T) ϕ(ci, T)| for any T > 0. Then, assuming γi is non-increasing for all i [K], we have Rn (A, D) e Rn (A, D) + 3 max i [K] Ri( e T i ). This result says that if an approximate objective function ϕ(ci, ) is used, then the regret Rn(A, D) of an allocation scheme A can be decomposed into its tracking regret, e Rn(A, D) and the maximum approximation error between γi and ϕ(ci, ) computed at ( e T i )K i=1. Lemma 1 is proved in Appendix B. The key step in the proof is bounding the quantity |ϕ(ci, e T i ) γi(T i )| with 2Ri( e T i ). 4.2. Adaptive Allocation for ℓ2 2-Distance The squared ℓ2-distance between two distributions P and Q is defined as Dℓ2(P, Q) := Pl j=1(pj qj)2. In this case, we can compute the objective function of (1) in closedform as γi(Ti) = E Dℓ2( ˆPi, Pi) = E l X j=1 (ˆpij pij)2 Ti := c(ℓ2) i Ti := ϕ(c(ℓ2) i , Ti). Note that the function γi(Ti) = ϕ(c(ℓ2) i , Ti) = c(ℓ2) i /Ti belongs to F. The oracle allocation is obtained by equalizing c(ℓ2) i /Ti, for all i [K], and can be written as T i = e T i = c(ℓ2) i /(PK k=1 c(ℓ2) k ) n := λ(ℓ2) i n. Next, we present a result on the deviation between c(ℓ2) i and its empirical version ˆc(ℓ2) i,t = 1 Pl j=1 ˆp2 ij. Lemma 2. Define δt := 6δ/(Klπ2t2), e(ℓ2) i,t := p (l + 2)2 log(1/δt)/2Ti,t and the event E1 = t [n] i [K] {|c(ℓ2) i ˆc(ℓ2) i,t | e(ℓ2) i,t }. Then we have P(E1) 1 δ. Using Lemma 2 (proved in Appendix D), we can define the required UCB for c(ℓ2) i as u(ℓ2) i,t := ˆc(ℓ2) i,t + e(ℓ2) i,t to be plugged into Algorithm 1 to obtain an adaptive sampling scheme for Dℓ2, which we shall refer to as Aℓ2. We can now state the bound on the regret incurred by the allocation scheme Aℓ2 (proof in Appendix D). Theorem 2. If we implement the algorithm Aℓ2 with a budget n and δ = n 5/2, then for n large enough, and Eℓ2 := max1 i K |Ti e T i |, we have Adaptive Sampling for Distribution Estimation Eℓ2 = O( n), and Rn(Aℓ2, Dℓ2) = O(n 3/2). The precise meaning of the term n large enough , as well as the exact expression of the hidden constants in the above expressions are provided in Appendix D. Note that the O(n 3/2) convergence rate of Thm. 2 recovers the rate derived in Carpentier et al. (2011) for the special case of Bernoulli (Pi)K i=1. Finally, note that in contrast to the adaptive scheme which achieves a regret of O n 3/2 , employing the uniform sampling scheme results in a regret of maxi c(ℓ2) i K/n PK i=1 c(ℓ2) i /n = O(1/n). 4.3. Adaptive Allocation for ℓ1-Distance The ℓ1-distance between two distributions P and Q is defined as Dℓ1(P, Q) := Pl j=1 |pj qj|. Note that the totalvariation distance, DTV, is related to Dℓ1 as DTV = 1 2Dℓ1. In this case, the objective function γi can be obtained in closed-form using the expression for mean absolute deviation E[|ˆpij pij|] given in Diaconis & Zabell (1991, Eq. 1.1). However, since this expression does not belong to F, we first obtain an approximation of γi in F as γi(Ti) = E Dℓ1( ˆPi, Pi) := E l X j=1 |ˆpij pij| E (ˆpij pij)2 = 1 Ti := c(ℓ1) i Ti := ϕ(c(ℓ1) i , Ti). (5) (a) follows from the Jensen s inequality and the concavity of the square-root function. We can check that the approximate objective function ϕ(c(ℓ1) i , Ti) = c(ℓ1) i / Ti with c(ℓ1) i = Pl j=1 p pij(1 pij) lies in F. The approx-oracle allocation is given by e T i = (c(ℓ1) i )2/C2 ℓ1 n := λ(ℓ1) i n, where C2 ℓ1 = PK i=1(c(ℓ1) i )2. In order to obtain the adaptive allocation scheme for the ℓ1-distance, which we shall refer to as Aℓ1, we now derive high probability upper-bounds on (c(ℓ1) i )K i=1 and then plug them into Algorithm 1. Lemma 3. Define δt := 3δ/(Klπ2t2), e(ℓ1) ij,t := p 2 log(2/δt)/Ti,t, and the event E2 := T ˆpij,t(1 ˆpij,t) p pij(1 pij)| e(ℓ1) ij,t . Then, we have P(E2) 1 δ. The proof (details in Appendix E.1) relies on an application of a concentration inequality of the standard deviation of random variables derived in Maurer & Pontil (2009, Thm. 10), followed by two union bounds. Lemma 3 allows us to define high probability upper-bounds on the parameters c(ℓ1) i as u(ℓ1) i,t := ˆc(ℓ1) i,t + e(ℓ1) i,t , where e(ℓ1) i,t = Pl j=1 e(ℓ1) ij,t = p 2l2 log(2/δt)/Ti,t. We now state the regret bound for the adaptive allocation scheme Aℓ1 (proof in Appendix E.2). Theorem 3. If we implement the algorithm Aℓ1 with budget n and δ = 1/n, then for n large enough, and Eℓ1 := max1 i K |Ti e T i |, we have Eℓ1 = O( n), and Rn(Aℓ1, Dℓ1) = O(n 3/4). The exact expressions for the hidden constants in the above bounds are derived in Appendix E.2. As a reference, we note that using the uniform allocation for the ℓ1 loss would result in a regret of O n 1/2 which is larger than the O n 3/4 regret achieved by the adaptive scheme. 4.4. Adaptive Allocation for f-Divergence For a convex function f : R 7 R satisfying f(1) = 0, the f-divergence between two distributions P and Q is defined as Df (P, Q) := Pl j=1 qjf(pj/qj). Since we cannot obtain a closed-form expression for the objective function γi of f-divergence, we proceed by writing Df( ˆPi, Pi) = D(r) f ( ˆPi, Pi) + Ri,r+1, where D(r) f ( ˆPi, Pi) is the r-term Taylor s approximation of Df( ˆPi, Pi), i.e., D(r) f ( ˆPi, Pi) := 1 pm 1 ij (ˆpij pij)m, (6) and Ri,r+1 = Pl j=1 Rij,r+1 is its remainder term (assuming f is analytic at 1), i.e., f (m)(1) m! pm 1 ij (ˆpij pij)m. (7) Note that in (6) and (7), f (m)( ) is the mth derivative of f. We now define the approximate objective function for an f-divergence as ϕ(ci, Ti) := E[D(r) f ( ˆPi, Pi)] f (m)(1) m! pm 1 ij T m i E h Ti X s=1 Z(s) ij mi . (8) Note that the exact value of the parameter ci above depends on the values of the terms f (m)(1), for 1 m r. Next, we present a general result on the quality of the approximation of γi with ϕ(ci, Ti) under the following two assumptions on f: (f1) f(x) is real-analytic at the point x = 1 and (f2) f (m)(1)/m! C1 < , m N. Both these assumptions are satisfied by several commonly used f-divergences, namely KL-divergence with f(x) = x log x, χ2-divergence with f(x) = (x 1)2, and Hellinger distance with f(x) = 2(1 x). Lemma 4. Assume that f satisfies (f1) and (f2). Then, there exists a constant Cf,r+1 < , whose exact definition is given by Eqs. 23, 24, and 28 in Appendix F.2, such that the following holds: E [Rij,r+1] Cf,r+1(pij Ti) (r+1)/2. Adaptive Sampling for Distribution Estimation To proceed further according to the roadmap of Section 4.1, we need to construct an upper-bound of the parameter ci that depends on the specific choice of function f. We carry out these derivations for f(x) = x log x (i.e., KL-divergence) in Section. 4.4.1 below. Remark 1. An alternative approach is to proceed by assuming that there exist τ0,i and τ1,i such that with probability at least 1 δ, we have τ0,i Ti τ1,i for i [K] (analogous to the statement of Thm. 1). Under this assumption, we can obtain a very general regret decomposition for arbitrary fdivergence distance measures satisfying (f1) and (f2). The details of this approach are given in Appendix F.1, and in particular the formal statement of regret decomposition is in Thm. 7 in Appendix F.1. 4.4.1. ADAPTIVE ALLOCATION FOR KL-DIVERGENCE The KL-divergence between distributions P and Q is defined as DKL(P, Q) := Pl j=1 pj log(pj/qj). We begin by deriving its r-term approximation with r = 5, i.e., E DKL( ˆPi, Pi) = E l X j=1 ˆpij log(ˆpij/pij) 2Ti + 1 12T 2 i pij 1 + O(1/T 3 i ), (9) where (a) is by calculating the 5th order Taylor s approximation of the mapping x 7 x log(x). The calculations involved in this derivation are described in Harris (1975, Sec. 2). The choice of r = 5 is sufficient as it is the smallest r for which the approximation error, which is of O(n 3) according to Lemma 4, is smaller than the tracking regret, that as we will show in the proof of Thm. 4, is of O(n 5/2). Eq. (9) gives us the approximate objective function ϕ(c(KL) i , Ti) := l 1 2Ti + c(KL) i T 2 i , with c(KL) i := Pl j=1 1/pij 1 /12. Note that this ϕ(c(KL) i , Ti) belongs to the class of functions F introduced in Definition 1. Deriving the approx-oracle allocation ( e T i )K i=1 requires solving a cubic equation. Instead of computing the exact form of e T i , we show in Lemma 5 that the deviation of e T i from the uniform allocation is bounded by a problem-dependent constant, implying that the uniform allocation is near-optimal. This is not surprising as the first order approximation of ϕ (the first term on the RHS of Eq. 9) does not change with Pi. Lemma 5. For (Pi)K i=1 (η) l and ( e T i )K i=1 denoting the approx-oracle allocation, we have e T i T0 K c(KL) max c(KL) min l 1 , i [K], where c(KL) min and c(KL) max denote the minimum and maximum values of c(KL) i , respectively. Next, with eij,t := p 2 log(2/δt)/Ti,t, we define the following upper-bound for the parameters c(KL) i i [K]: u(KL) i,t = ( Pl j=1 1 ˆpij,t eij,t 1 /12 if ˆpij 7eij,t 2 + otherwise. The deviation of this upper-bound from the true parameters c(KL) i can be computed by exploiting the convexity of the mapping x 7 1/x and the exact expression of the length of the confidence interval reported in Lemma 10 in Appendix G. These upper-bounds can then be plugged into Algorithm 1 to obtain an adaptive allocation scheme for KL-divergence, denoted by AKL. Finally, we state the regret bound for AKL in the following theorem (proof in Appendix G): Theorem 4. Let (Pi)K i=1 (η) l and the adaptive scheme AKL is implemented with δ = (3K/n)6. Then for large enough n and with EKL := maxi [K] |Ti e T i |, we have EKL = O(n 1/2), and Rn(AKL, DKL) = O(n 5/2). As we showed in Lemma 5, the approx-oracle allocation for DKL is close, although not identical to, the uniform allocation. This is due to the fact that the first order term in the approximation given in (9) only depends on the support size of the distributions, which is assumed to be the same for all K distributions in our setting. Thus, the uniform allocation is the approx-oracle allocation for the first order allocation for DKL and it achieves an upper bound on the regret of O n 2 . However, as shown in Theorem 4 above, consideration of the higher order terms allows us to achieve a O n 5/2 regret (see Eq. 45 in Appendix G for exact expression) 4.5. Adaptive Allocation for Separation Distance The separation distance (Gibbs & Su, 2002) between distributions P and Q is defined as Ds(P, Q) := maxj [l](1 pj/qj). We start by introducing new notations. Given a probability distribution Pi l and a non-empty set S [l], we define pi,S := P j S pij. We also define the functions ρ1(p) := p (1 p)/p and ρ2(p) := ρ1(p)+ρ1(1 p), and introduce the terms c(s) i := Pl j=1 ρ1(pij) and c(s) i := max S [l]{ρ2(pi,S)}. Note that c(s) i = c(s) i for l = 2. Because of the max operation in the definition of Ds, in general, we cannot obtain a closed-form expression for the objective function γi(Ti) = E[Ds( ˆPi, Pi)]. We now state a key lemma (proof in Appendix H) that provides an approximation of E[Ds( ˆPi, Pi)]. Lemma 6. For a distribution Pi l, let ˆPi = (ˆpij)l j=1 be the empirical distribution constructed from Ti i.i.d. draws from Pi. Then, we have 1 2πTi C(s) i Ti E Ds( ˆPi, Pi) c(s) i 1 2πTi + C(s) i Ti , where C(s) i and C(s) i are Pi-dependent constants defined by (47) and (50) in Appendix H. Adaptive Sampling for Distribution Estimation The proof of the upper bound in Lemma 6 proceeds by first upper bounding the term inside the expectation with a normalized sum of random variables, and then using a non-asymptotic version of the Central Limit Theorem (Ross, 2011, Thm. 3.2). For deriving the lower bound, we first show (Lemma 12 in Appendix H) that bunching together probability masses can only reduce the separation distance between two distributions, and then proceed by another application of (Ross, 2011, Thm. 3.2). Lemma 6 gives us an interval that contains the true objective function we aim to track. To implement the adaptive scheme, we employ the approximate objective function ϕ(c(s) i , Ti) := c(s) i p 1/2πTi. In order to instantiate Algorithm 1 for Ds, we require to derive high probability confidence intervals for the terms p (1 pij)/pij in the definition (c(s) i )K i=1. We use the event E1 defined in Lemma 2 and prove the following result: Lemma 7. Let Pi (η) l , and the event E1 and the terms δt and eij,t defined as in Lemma 2. Define the terms ai,t := 8 log(2/δt)/Ti,t 1/4 and bi,t := l ai,t η max 1, ai,t 2η3/2 . Then, under the high probability event E1, we have 1 ˆpij,t eij,t 1 1 pij 1+bi,t. Using the concentration result of Lemma 7, we can now implement Algorithm 1 with the upper-bound u(s) i,t = Pl j=1 q 1 ˆpij,t eij,t 1 /( 2π), if ˆpij,t 7eij,t/2, and u(s) i,t = + , otherwise. This will give us an adaptive allocation scheme for the separation distance, which we shall refer to it as As. Finally, we prove the following regret bound for As (proof in Appendix H.3). Theorem 5. Let Pi (η) l and the adaptive scheme As is implemented with δ = η/n. Then, for large enough n and with Es := max1 i K |Ti e T i |, we have Es = O( n), and Rn(As, Ds) = O maxi [K] c(s) i c(s) i The exact condition for n, and the expressions for Es and the higher-order terms in (10) are given in Appendix H.3. Remark 2. Note that the second term on the RHS of (10) is the approximation error term in the regret decomposition introduced in Lemma 1, while the second term is the tracking regret w.r.t. the approx-oracle allocation scheme. In general, the approximation error, which is O(n 1/2), dominates the tracking regret term, which is O(n 3/4). However, for the special case of l = 2, the approximation error term becomes O(1/n) using the fact that c(s) i = c(s) i in Lemma 6, and we achieve an overall regret of O(n 3/4). 5. Lower Bound Lemma 1 provided a general high probability bound on the deviation of the adaptive allocation (Ti)K i=1 from the approxoracle allocation ( e T i )K i=1. In Sec. 4, we observed that when specialized to the objective functions corresponding to Dℓ2, Dℓ1 and Ds, we have |Ti e T i | = O( n). A natural question to ask is whether there exists any other adaptive scheme that can achieve a smaller deviation from the approxoracle allocation. We now show that this is not the case by deriving a lower-bound on the expected deviation of any allocation scheme A. To derive the lower-bound, we consider a specific class of problems with two arms, K = 2, Bernoulli distributions, l = 2, and objective functions of the form ϕ(ci, Ti) = ci/T α i , for some α > 0. For some p0 (1/2, 1) and ϵ > 0, we define two Bernoulli distributions P1 Ber (p0) and P2 Ber (p0 ϵ). We consider two problem instances P1 and P2 with K = 2 and distributions P1 and P2, but with orders swapped, i.e., P1 = (P1, P2) and P2 = (P2, P1). Finally, we introduce the notation κ (p) to represent the distribution dependent constant in the objective function ϕ corresponding to a Ber (p) distribution. We now state the lower bound result. Theorem 6. For some p0 (1/2, 3/4] and 0 < ϵ < p0 1/2, consider two tracking problems P1 = (P1, P2) and P2 = (P2, P1), with P1 Ber (p0) and P2 Ber (p0 ϵ) and objective function ϕ(c, T) = c/T α for α > 0 where the constant c = κ (p) for Ber (p) distributions. Finally, introduce the notation τ = (n/2)(|κ (p0)1/α κ (p0 ϵ)1/α |)/(|κ (p0)1/α + κ (p0 ϵ)1/α |). If (Ti)2 i=1 denotes the allocation of any allocation scheme A, we have max P1,P2 max i=1,2 E h |Ti e T i | i sup 0<ϵ