# adaptive_sampling_for_estimating_probability_distributions__5105e16e.pdf

Adaptive Sampling for Estimating Probability Distributions

Shubhanshu Shekhar 1 Tara Javidi 1 Mohammad Ghavamzadeh 2

Abstract We consider the problem of allocating a ﬁxed budget of samples to a ﬁnite set of discrete distributions to learn them uniformly well (minimizing the maximum error) in terms of four common distance measures: ℓ2 2, ℓ1, f-divergence, and separation distance. To present a uniﬁed treatment of these distances, we ﬁrst propose a general optimistic tracking algorithm and analyze its sample allocation performance w.r.t. an oracle. We then instantiate this algorithm for the four distance measures and derive bounds on their regret. We also show that the allocation performance of the proposed algorithm cannot, in general, be improved, by deriving lower-bounds on the expected deviation from the oracle allocation for any adaptive scheme. We verify our theoretical ﬁndings through some experiments. Finally, we show that the techniques developed in the paper can be easily extended to learn some classes of continuous distributions as well as to the related setting of minimizing the average error (rather than the maximum error) in learning a set of distributions.

1. Introduction

Consider the problem in which a learner must allocate n samples among K discrete distributions to construct uniformly good (minimizing the maximum error) estimates of these distributions in terms of a distance measure D. Depending on D, certain distributions may require much fewer samples than the others to be estimated with the same precision. The optimal sampling strategy for a given n requires knowledge of the true distributions. The goal of this paper is to design adaptive allocation strategies that converge to the optimal strategy, oblivious to the true distributions.

The problem described above models several applications which are not captured by existing works. Here, we describe

1ECE Department, University of California, San Diego 2Google Research. Correspondence to: Shubhanshu Shekhar <shshekha@eng.uscd.edu>.

Proceedings of the 37 th International Conference on Machine Learning, Online, PMLR 119, 2020. Copyright 2020 by the author(s).

some such applications. (1) Opinion Polling: Suppose there are K groups of voters who have different preference distributions over l candidates in an election. While some groups might heavily favour a single candidate, others might be indifferent, resulting in a more uniform distribution over the set of candidates. In this setting, how should the polling agency allocate its sampling budget? Intuitively, more samples should be allocated to the indifferent voter groups to construct uniformly good estimates of their preference distributions. (2) Compression of text ﬁles: Given a sampling budget of n bytes, consider the problem of designing codes with minimum average length for text ﬁles in K different languages. Since different languages may have different symbol frequencies, this can be formulated as learning K distributions uniformly well in terms of certain f-divergences. (3) Learning MDP model: In many sequential decision-making problems, the agent s interaction with the environment is modeled as a Markov decision process (MDP). In these problems, it is often important to accurately estimate the dynamics (i.e., the transition structure of the MDP), given a ﬁnite exploration budget. Learning the MDP model is equivalent to estimating S A distributions, where S and A are the number of states and actions of the (ﬁnite) MDP. Therefore, assuming the existence of a known policy that can efﬁciently transition the MDP between any two states, the problem reduces to ﬁnding the optimal allocation of samples to these S A distributions. Thus, the framework studied in this paper provides the ﬁrst step towards solving the general problem of constructing accurate models for MDPs. The requirement of a known policy to transition between states can be relaxed by employing the techniques recently developed for efﬁcient exploration in MDPs (e.g., Tarbouriech & Lazaric 2019; Hazan et al. 2019; Cheung 2019), which we leave for future work.

Antos et al. (2008) were the ﬁrst to study the problem of learning the mean of K distributions uniformly well, and proposed and analyzed an algorithm based on forced exploration strategy. Carpentier et al. (2011) proposed and analyzed an alternative approach for the same problem, based on the UCB algorithm (Auer et al., 2002). Carpentier & Munos (2011) analyzed an optimistic policy for the related problem of stratiﬁed-sampling, where the goal is to learn K distributions in terms of a weighted average distance (instead of max). Soare et al. (2013) extended the optimistic strategy to the case of uniformly estimating K

Adaptive Sampling for Distribution Estimation

linearly correlated distributions. Riquelme et al. (2017) applied the optimistic strategy to the problem of allocating covariates (drawn in an i.i.d. manner from some distribution) for uniform estimation of K linear models. The prior work mentioned above have focused solely on estimating the means of distributions in squared error sense, and their analytic techniques do not extend to learning entire distributions. In this paper, we generalize the above-mentioned prior work by considering the problem of active sampling to uniformly learn K distributions in terms of pre-speciﬁed distance measures on the space probability distributions.

Overview of Results. Intuitively, the optimal allocation should equalize the expected distance between the true distribution and the resulting empirical estimate for all the K distributions. This allocation, however, may have a complex dependence on the true distribution, Pi, for 1 i K. Our approach in this paper is to ﬁrst identify an objective function which (i) is a good approximation of the true objective given a distance measure D, (ii) depends on the original distribution Pi through a single real-valued parameter ci, and (iii) has a decoupled dependence on ci and Ti. In Sec. 3, we formally deﬁne an appropriate function class F within which the objective functions for various distance measures should lie. We then propose a generic optimistic tracking strategy (Alg. 1) which addresses the trade-off in constructing better estimates of the parameter ci, and using the existing estimates of ci to drive the allocation towards the optimal. We also obtain a general bound on its deviation from an (approx-) oracle allocation (deﬁned in Sec. 3). In Sec. 4, we ﬁrst present a road-map for designing adaptive sampling schemes for arbitrary loss functions using the results of Sec. 3, and then specialize this to the case of four widely-used distance measures: ℓ2 2, ℓ1, f-divergence, and separation distance. For each distance measure, we obtain bounds on the regret of the proposed sampling scheme w.r.t. an oracle strategy. In Sec. 5, derive matching lowerbounds on the expected deviation from oracle allocation for any algorithm. Experiments with synthetic examples in Sec. 6 validate our theoretical results. Finally, we discuss how our techniques can be extended to learning some classes of continuous distributions as well as to the related problem of minimizing the average error in Sec. 7.

Technical Contributions. The results of this paper require generalizing existing techniques, as well as introducing new methods. More speciﬁcally, the proof of Theorem 1 abstracts out the arguments of Carpentier et al. (2011, Thm. 1) to deal with a much larger class of objective functions. Prior work with mean-squared error (Antos et al., 2008; Carpentier et al., 2011) required bounding the ﬁrst and second moments of random sums that could be achieved by a direct application of Wald s equations (Durrett, 2019, Thm. 4.8.6). Our results on f-divergence (Thm. 7 and Lemma 9 in Appendix F) require analyzing higher moments of random

sums for which Wald s equations are not applicable. Deriving the approximate objective function for separation distance involves estimating the expectation of the maximum of some correlated random variables. We obtain upper and lower bounds on this expectation in Lemma 6 by ﬁrst approximating the maximum with certain sums, and then bounding the sums using a normal approximation result (Ross, 2011, Thm. 3.2). 2. Problem Setup Consider K discrete distributions, (Pi)K i=1, that belong to the (l 1)-dimensional probability simplex l, and take values in the set X = {x1, . . . , xl}. Each distribution Pi is equivalently represented by a vector Pi = (pi1, . . . , pil) with pij 0, j [l], and Pl j=1 pij = 1. For any integer b > 0, we denote by [b], the set {1, . . . , b}. Given a budget of n K samples, we consider the problem of allocating samples to each of the K distributions in such a way that the maximum (over the K distributions) discrepancy between the empirical distributions (estimated from the samples) and the true distributions is minimized. To formally deﬁne this problem, suppose an allocation scheme assigns (Ti)K i=1 samples to the K distributions, such that Ti 0, i [K], and PK i=1 Ti = n. Also suppose that ˆPi is the empirical distribution with ˆpij = Tij/Ti, where Tij denotes the number of times the output xj was observed in the Ti draws from Pi, and D : l l 7 [0, ) is a distribution distance measure. Then, our problem of interest can be deﬁned as ﬁnding an allocation scheme (Ti)K i=1 that solves the following constrained optimization problem:

min T1,...,TK max i [K] E D( ˆPi, Pi) , s.t.

i=1 Ti = n. (1)

We refer to the (non-integer) solution of (1) with full knowledge of (Pi)K i=1 as the oracle allocation (T i )K i=1. It is important to note that (T i )K i=1 ensure that the objective functions γi(Ti) := E D( ˆPi, Pi) are equal, for all i [K]. However, in practice, (Pi)K i=1 are not known. In this case, we refer to (1) as a tracking problem in which the goal is to design adaptive sampling strategies that approximate the oracle allocation using running estimates of (Pi)K i=1.

Choice of the Distance Measure. It is expected that the optimal allocation will be strongly dependent on the distance measure D. We study four distances: ℓ2 2, ℓ1 or total variation (TV), f-divergence, and separation distance in this paper. These distances include all those in (Gibbs & Su, 2002) that do not require a metric structure on X. The f-divergence family generalizes the well-known KL-divergence (DKL) and includes a number of other common distances, such as total variation (DTV), Hellinger (DH), and chi-square (Dχ2). Applications of f-divergence include source and channel coding problems (Csisz ar, 1967; 1995), testing goodness-of-ﬁt (Gyorﬁet al., 2000), and distribution estimation (Barron et al., 1992). The common f-divergences

Adaptive Sampling for Distribution Estimation

mentioned above satisfy the following chain of inequalities: DTV DH DKL p Dχ2, that deﬁne a hierarchy of convergence among these measures (Tsybakov, 2009, Eq. 2.27). The separation distance Ds(P, Q) (deﬁned formally in Sec. 4.5) arises naturally in the study of the convergence of symmetric Markov chains to their stationary distribution. More speciﬁcally, if Q is the stationary distribution of a Markov chain and (Pt)t 1 is its state distribution at time t, such that Q = PT at a random time T, then Ds(Pt, Q) P(T > t) (Aldous & Diaconis, 1987, Sec. 3).

Choice of estimator. In this work, we ﬁx the estimated distribution ˆPi to be the empirical distribution, i.e., ˆPi = [ˆpij]l j=1 where ˆpij = Tij/Ti. While the empirical estimator is known to be suboptimal in a min-max sense (Kamath et al., 2015), the additional error due to the deviation of E[D( ˆPi, Pi)] for some of the above distances (ℓ2 2, ℓ1 and f-divergence) does not change the ﬁnal regret obtained. For instance, for the ℓ2 2 distance, the results of (Kamath et al., 2015) show that E[Dℓ2 2( ˆPi, Pi)] differs from the min-max value by a O n 3/2 term. Since this term is of the same order as the regret we derive in Theorem 2, we conclude that for this loss the regret cannot be improved by using the min-max optimal estimator estimator. Similar results can be shown for ℓ1 distance and the f divergence family as well.

Allocation Scheme and Regret. An adaptive allocation scheme A consists of a sequence of mappings (πt)t 1, where each mapping πt : N (X [K])t 1 7 [K] selects an arm to pull1 at time t, based on the budget n and the history of pulls and observations up to time t. For an allocation scheme A, a sampling budget n, and a distance measure D, we deﬁne the risk incurred by A as

Ln(A, D) = max i [K] E D( ˆPi, Pi) . (2)

We denote by A , the oracle allocation rule. The performance of an allocation scheme A is measured by its suboptimality or regret w.r.t. A , i.e.,

Rn(A, D) := Ln(A, D) Ln(A , D). (3)

Notations.2 For 0 < η < 1/2, we deﬁne the η-interior of (l 1)-dimensional simplex l, as (η) l := {P l | η pj 1 η, j [l]}. We use the Bernoulli random variable Z(s) ij to represent the indicator that the sth draw from arm i is equal to xj X. Note that for any draw s, we have E[Z(s) ij ] = pij. For any t [n], we deﬁne

Wij,t = Pt s=1 Z(s) ij , where Z(s) ij := Z(s) ij pij is a centered Bernoulli variable. We also note that several terms such as ϕ, A, B, and en (to be introduced in Sec. 3) are overloaded for different distance measures. For instance, we use ϕ for

1Each distribution can be considered as an arm, and thus, we use the terms sampling from a distribution and pulling an arm interchangeably throughout the paper. 2See Table A.1 in App. A for a list of all the notations used.

both ℓ1 and KL-divergence, instead of writing ϕ(ℓ1) and ϕ(KL). The meaning should be clear from the local context. 3. General Allocation via Optimistic Tracking Before proceeding to the analysis of problem (1) for speciﬁc distance measures, we ﬁrst study an abstract yet more stylized class of problems similar to (1), where the dependency of the objective (loss) functions on the distribution parameter versus number of allocated samples can be explicitly decoupled. In particular, let us consider the problem in which the objective functions satisfy certain regularity conditions that we deﬁne next. Deﬁnition 1. We use F to denote the class of functions ϕ : R R 7 R satisfying the following properties: 1) ϕ( , T) is concave and non-decreasing for all T R, 2) ϕ(c, ) is convex and non-increasing for all c R, and 3) ϕ(c, ) and ϕ( , T) are differentiable for all c, T (0, ).

We now can deﬁne an analog of the optimization problem (1) with the objective function belongs to F:

min T1,...,TK max i [K] ϕ(ci, Ti), s.t.

i=1 Ti = n, (4)

where the parameters (ci)K i=1 depend solely on the distance measure D and distributions (Pi)K i=1. Note that in this setting the budget allocation reduces to balancing the value of the objective function by tracking the distribution-dependent parameter (ci)K i=1 (to be estimated). We refer to the solution of (4) with full knowledge of (ci)K i=1 as ( e T i )K i=1, and to the corresponding allocation scheme as e A . Similar to (1), when parameters (ci)K i=1 are unknown, we refer to (4) as a general tracking problem.

Optimistic Tracking Algorithm. We now propose and analyze an adaptive sampling scheme, motivated by the upper-conﬁdence bound (UCB) algorithm (Auer et al., 2002) in multi-armed bandits, for solving the general tracking problem (4). The proposed scheme, whose pseudo-code is shown in Algorithm 1, samples optimistically by plugging in high probability upper-bounds of ci in the objective function ϕ. Formally, for each arm i [K] and time t [n], we denote by Ti,t, the number of times that arm i has been pulled prior to time t. We deﬁne the (1 δ)-probability (high probability) event E := T

i [K] |ˆci,t ci| ei,t) , where ˆci,t is the empirical estimate of ci and ei,t is the radius (half of the length) of its conﬁdence interval at time t computed using Ti,t samples. We deﬁne the upper-bound of ci at time t as ui,t := ˆci,t + ei,t with the convention that ui,1 = + . In the rest of the paper, we use ˆPi,t and ˆpij,t to represent the estimates of Pi and pij at time t, computed by Ti,t i.i.d. samples. We now state a theorem that bounds the deviation of the allocation obtained by Algorithm 1 (our optimistic tracking algorithm), (Ti)K i=1, from the allocation ( e T i )K i=1, i.e., the solution to (4) when the parameters (ci)K i=1 are known. Before

Adaptive Sampling for Distribution Estimation

Algorithm 1 Optimistic Tracking Algorithm

1: Input: K, n, δ 2: Initialize t 1; 3: while t n do 4: if t K then 5: It = t; 6: else 7: It = arg maxi [K] ϕ(ui,t, Ti,t); 8: end if 9: Observe X PIt; t t + 1; 10: Update ui,t, i [K]; 11: end while

stating our main theorem, we deﬁne g i := ϕ(c, e T i ) c c=ci and h i := ϕ(ci,T )

T T = e T i .

Theorem 1. Deﬁne A := maxi [K] g i , B := maxi [K] h i , and en := maxi [K] e i , where e i is the radius of the conﬁdence interval of arm i after e T i pulls. Then, under the event E, and assuming that B > 0 and e T i > 1, i [K], we have

B Ti e T i 2A(K 1) en

The proof of Theorem 1, given in Appendix C, generalizes the arguments used in Carpentier et al. (2011, Thm. 1) to handle any objective function ϕ F.

The idea behind preceding discussion is the following: in cases where the objective function γi in (1) lies in F, we can use the result of Theorem 1 to obtain a bound on the deviation of the allocation of the resulting Algorithm 1 from the oracle deviation. In other cases, we can select an appropriate approximation of γi within the function class F, and then use Theorem 1 in conjunction with a regret decomposition result (Lemma 1) to obtain the required regret bounds. 4. Adaptive Allocation Algorithms Algorithm 1 along with the corresponding Theorem 1 provide us with a road-map to design adaptive sampling algorithms for the tracking problem (1) for different choices of distribution distance D.

4.1. Road-map We proceed in the following steps: Step 1: If γi := E[D( ˆPi, Pi)] F (Def. 1), then derive an approximation of γi( ), denoted by ϕ(ci, ) lying in F. If γi F, then set ϕ(ci, ) = γi( ). Step 2: Construct an appropriate UCB for the parameter ci for i [K], to instantiate Algorithm 1, and use Theorem 1 to get a bound on the deviation of the allocation of Algorithm 1 from optimal. Step 3: Derive an upper-bound on the regret by employing the decomposition given in Lemma 1 below, along with some distance-speciﬁc analysis.

In the sequel, we shall refer to ( e T i )K i=1, the optimal solutions to (4), as the approx-oracle allocation, and the corresponding (non-adaptive) strategy, e A , as the approx-oracle allocation rule. We now present the key regret decomposition result that will be used in deriving the regret bounds for the cases where γi F and an approximation ϕ(ci, ) is used in Algorithm 1.

Lemma 1. For any allocation scheme A, a distance measure D, and sampling budget n, deﬁne e Rn(A, D) = Ln(A, D) Ln( e A , D) and Ri(T) := |γi(T) ϕ(ci, T)| for any T > 0. Then, assuming γi is non-increasing for all i [K], we have

Rn (A, D) e Rn (A, D) + 3 max i [K] Ri( e T i ).

This result says that if an approximate objective function ϕ(ci, ) is used, then the regret Rn(A, D) of an allocation scheme A can be decomposed into its tracking regret, e Rn(A, D) and the maximum approximation error between γi and ϕ(ci, ) computed at ( e T i )K i=1. Lemma 1 is proved in Appendix B. The key step in the proof is bounding the quantity |ϕ(ci, e T i ) γi(T i )| with 2Ri( e T i ).

4.2. Adaptive Allocation for ℓ2 2-Distance The squared ℓ2-distance between two distributions P and Q is deﬁned as Dℓ2(P, Q) := Pl j=1(pj qj)2. In this case, we can compute the objective function of (1) in closedform as γi(Ti) = E Dℓ2( ˆPi, Pi) = E l X

j=1 (ˆpij pij)2

Ti := c(ℓ2) i Ti := ϕ(c(ℓ2) i , Ti).

Note that the function γi(Ti) = ϕ(c(ℓ2) i , Ti) = c(ℓ2) i /Ti belongs to F. The oracle allocation is obtained by equalizing c(ℓ2) i /Ti, for all i [K], and can be written as T i = e T i = c(ℓ2) i /(PK k=1 c(ℓ2) k ) n := λ(ℓ2) i n.

Next, we present a result on the deviation between c(ℓ2) i and its empirical version ˆc(ℓ2) i,t = 1 Pl j=1 ˆp2 ij.

Lemma 2. Deﬁne δt := 6δ/(Klπ2t2), e(ℓ2) i,t := p

(l + 2)2 log(1/δt)/2Ti,t and the event E1 = t [n] i [K] {|c(ℓ2) i ˆc(ℓ2) i,t | e(ℓ2) i,t }. Then we have P(E1) 1 δ.

Using Lemma 2 (proved in Appendix D), we can deﬁne the required UCB for c(ℓ2) i as u(ℓ2) i,t := ˆc(ℓ2) i,t + e(ℓ2) i,t to be plugged into Algorithm 1 to obtain an adaptive sampling scheme for Dℓ2, which we shall refer to as Aℓ2. We can now state the bound on the regret incurred by the allocation scheme Aℓ2 (proof in Appendix D).

Theorem 2. If we implement the algorithm Aℓ2 with a budget n and δ = n 5/2, then for n large enough, and Eℓ2 := max1 i K |Ti e T i |, we have

Adaptive Sampling for Distribution Estimation

Eℓ2 = O( n), and Rn(Aℓ2, Dℓ2) = O(n 3/2).

The precise meaning of the term n large enough , as well as the exact expression of the hidden constants in the above expressions are provided in Appendix D. Note that the O(n 3/2) convergence rate of Thm. 2 recovers the rate derived in Carpentier et al. (2011) for the special case of Bernoulli (Pi)K i=1. Finally, note that in contrast to the adaptive scheme which achieves a regret of O n 3/2 , employing the uniform sampling scheme results in a regret of maxi c(ℓ2) i K/n PK i=1 c(ℓ2) i /n = O(1/n).

4.3. Adaptive Allocation for ℓ1-Distance

The ℓ1-distance between two distributions P and Q is deﬁned as Dℓ1(P, Q) := Pl j=1 |pj qj|. Note that the totalvariation distance, DTV, is related to Dℓ1 as DTV = 1

2Dℓ1. In this case, the objective function γi can be obtained in closed-form using the expression for mean absolute deviation E[|ˆpij pij|] given in Diaconis & Zabell (1991, Eq. 1.1). However, since this expression does not belong to F, we ﬁrst obtain an approximation of γi in F as

γi(Ti) = E Dℓ1( ˆPi, Pi) := E l X

j=1 |ˆpij pij|

E (ˆpij pij)2 = 1 Ti

:= c(ℓ1) i Ti := ϕ(c(ℓ1) i , Ti). (5)

(a) follows from the Jensen s inequality and the concavity of the square-root function. We can check that the approximate objective function ϕ(c(ℓ1) i , Ti) = c(ℓ1) i / Ti with c(ℓ1) i = Pl j=1 p

pij(1 pij) lies in F.

The approx-oracle allocation is given by e T i = (c(ℓ1) i )2/C2 ℓ1 n := λ(ℓ1) i n, where C2 ℓ1 = PK i=1(c(ℓ1) i )2. In order to obtain the adaptive allocation scheme for the ℓ1-distance, which we shall refer to as Aℓ1, we now derive high probability upper-bounds on (c(ℓ1) i )K i=1 and then plug them into Algorithm 1.

Lemma 3. Deﬁne δt := 3δ/(Klπ2t2), e(ℓ1) ij,t := p

2 log(2/δt)/Ti,t, and the event E2 := T

ˆpij,t(1 ˆpij,t) p

pij(1 pij)|

e(ℓ1) ij,t . Then, we have P(E2) 1 δ.

The proof (details in Appendix E.1) relies on an application of a concentration inequality of the standard deviation of random variables derived in Maurer & Pontil (2009, Thm. 10), followed by two union bounds. Lemma 3 allows us to deﬁne high probability upper-bounds on the parameters c(ℓ1) i as u(ℓ1) i,t := ˆc(ℓ1) i,t + e(ℓ1) i,t , where e(ℓ1) i,t = Pl j=1 e(ℓ1) ij,t = p

2l2 log(2/δt)/Ti,t.

We now state the regret bound for the adaptive allocation scheme Aℓ1 (proof in Appendix E.2).

Theorem 3. If we implement the algorithm Aℓ1 with budget n and δ = 1/n, then for n large enough, and Eℓ1 := max1 i K |Ti e T i |, we have

Eℓ1 = O( n), and Rn(Aℓ1, Dℓ1) = O(n 3/4).

The exact expressions for the hidden constants in the above bounds are derived in Appendix E.2.

As a reference, we note that using the uniform allocation for the ℓ1 loss would result in a regret of O n 1/2 which is larger than the O n 3/4 regret achieved by the adaptive scheme.

4.4. Adaptive Allocation for f-Divergence

For a convex function f : R 7 R satisfying f(1) = 0, the f-divergence between two distributions P and Q is deﬁned as Df (P, Q) := Pl j=1 qjf(pj/qj). Since we cannot obtain a closed-form expression for the objective function γi of f-divergence, we proceed by writing Df( ˆPi, Pi) = D(r) f ( ˆPi, Pi) + Ri,r+1, where D(r) f ( ˆPi, Pi) is the r-term Taylor s approximation of Df( ˆPi, Pi), i.e.,

D(r) f ( ˆPi, Pi) :=

1 pm 1 ij (ˆpij pij)m, (6)

and Ri,r+1 = Pl j=1 Rij,r+1 is its remainder term (assuming f is analytic at 1), i.e.,

f (m)(1) m! pm 1 ij (ˆpij pij)m. (7)

Note that in (6) and (7), f (m)( ) is the mth derivative of f. We now deﬁne the approximate objective function for an f-divergence as

ϕ(ci, Ti) := E[D(r) f ( ˆPi, Pi)]

f (m)(1) m! pm 1 ij T m i E h Ti X

s=1 Z(s) ij mi . (8)

Note that the exact value of the parameter ci above depends on the values of the terms f (m)(1), for 1 m r.

Next, we present a general result on the quality of the approximation of γi with ϕ(ci, Ti) under the following two assumptions on f: (f1) f(x) is real-analytic at the point x = 1 and (f2) f (m)(1)/m! C1 < , m N. Both these assumptions are satisﬁed by several commonly used f-divergences, namely KL-divergence with f(x) = x log x, χ2-divergence with f(x) = (x 1)2, and Hellinger distance with f(x) = 2(1 x).

Lemma 4. Assume that f satisﬁes (f1) and (f2). Then, there exists a constant Cf,r+1 < , whose exact deﬁnition is given by Eqs. 23, 24, and 28 in Appendix F.2, such that the following holds:

E [Rij,r+1] Cf,r+1(pij Ti) (r+1)/2.

Adaptive Sampling for Distribution Estimation

To proceed further according to the roadmap of Section 4.1, we need to construct an upper-bound of the parameter ci that depends on the speciﬁc choice of function f. We carry out these derivations for f(x) = x log x (i.e., KL-divergence) in Section. 4.4.1 below. Remark 1. An alternative approach is to proceed by assuming that there exist τ0,i and τ1,i such that with probability at least 1 δ, we have τ0,i Ti τ1,i for i [K] (analogous to the statement of Thm. 1). Under this assumption, we can obtain a very general regret decomposition for arbitrary fdivergence distance measures satisfying (f1) and (f2). The details of this approach are given in Appendix F.1, and in particular the formal statement of regret decomposition is in Thm. 7 in Appendix F.1.

4.4.1. ADAPTIVE ALLOCATION FOR KL-DIVERGENCE The KL-divergence between distributions P and Q is deﬁned as DKL(P, Q) := Pl j=1 pj log(pj/qj). We begin by deriving its r-term approximation with r = 5, i.e.,

E DKL( ˆPi, Pi) = E l X

j=1 ˆpij log(ˆpij/pij)

2Ti + 1 12T 2 i

pij 1 + O(1/T 3 i ), (9)

where (a) is by calculating the 5th order Taylor s approximation of the mapping x 7 x log(x). The calculations involved in this derivation are described in Harris (1975, Sec. 2). The choice of r = 5 is sufﬁcient as it is the smallest r for which the approximation error, which is of O(n 3) according to Lemma 4, is smaller than the tracking regret, that as we will show in the proof of Thm. 4, is of O(n 5/2).

Eq. (9) gives us the approximate objective function

ϕ(c(KL) i , Ti) := l 1

2Ti + c(KL) i T 2 i , with c(KL) i := Pl j=1 1/pij

1 /12. Note that this ϕ(c(KL) i , Ti) belongs to the class of functions F introduced in Deﬁnition 1. Deriving the approx-oracle allocation ( e T i )K i=1 requires solving a cubic equation. Instead of computing the exact form of e T i , we show in Lemma 5 that the deviation of e T i from the uniform allocation is bounded by a problem-dependent constant, implying that the uniform allocation is near-optimal. This is not surprising as the ﬁrst order approximation of ϕ (the ﬁrst term on the RHS of Eq. 9) does not change with Pi.

Lemma 5. For (Pi)K i=1 (η) l and ( e T i )K i=1 denoting the approx-oracle allocation, we have e T i T0 K c(KL) max c(KL) min l 1 , i [K],

where c(KL) min and c(KL) max denote the minimum and maximum values of c(KL) i , respectively.

Next, with eij,t := p

2 log(2/δt)/Ti,t, we deﬁne the following upper-bound for the parameters c(KL) i i [K]:

u(KL) i,t =

( Pl j=1 1 ˆpij,t eij,t 1 /12 if ˆpij 7eij,t

2 + otherwise.

The deviation of this upper-bound from the true parameters c(KL) i can be computed by exploiting the convexity of the mapping x 7 1/x and the exact expression of the length of the conﬁdence interval reported in Lemma 10 in Appendix G. These upper-bounds can then be plugged into Algorithm 1 to obtain an adaptive allocation scheme for KL-divergence, denoted by AKL. Finally, we state the regret bound for AKL in the following theorem (proof in Appendix G):

Theorem 4. Let (Pi)K i=1 (η) l and the adaptive scheme AKL is implemented with δ = (3K/n)6. Then for large enough n and with EKL := maxi [K] |Ti e T i |, we have

EKL = O(n 1/2), and Rn(AKL, DKL) = O(n 5/2).

As we showed in Lemma 5, the approx-oracle allocation for DKL is close, although not identical to, the uniform allocation. This is due to the fact that the ﬁrst order term in the approximation given in (9) only depends on the support size of the distributions, which is assumed to be the same for all K distributions in our setting. Thus, the uniform allocation is the approx-oracle allocation for the ﬁrst order allocation for DKL and it achieves an upper bound on the regret of O n 2 . However, as shown in Theorem 4 above, consideration of the higher order terms allows us to achieve a O n 5/2 regret (see Eq. 45 in Appendix G for exact expression)

4.5. Adaptive Allocation for Separation Distance The separation distance (Gibbs & Su, 2002) between distributions P and Q is deﬁned as Ds(P, Q) := maxj [l](1 pj/qj). We start by introducing new notations. Given a probability distribution Pi l and a non-empty set S [l], we deﬁne pi,S := P

j S pij. We also deﬁne the functions ρ1(p) := p

(1 p)/p and ρ2(p) := ρ1(p)+ρ1(1 p), and introduce the terms c(s) i := Pl j=1 ρ1(pij) and c(s) i :=

max S [l]{ρ2(pi,S)}. Note that c(s) i = c(s) i for l = 2. Because of the max operation in the deﬁnition of Ds, in general, we cannot obtain a closed-form expression for the objective function γi(Ti) = E[Ds( ˆPi, Pi)]. We now state a key lemma (proof in Appendix H) that provides an approximation of E[Ds( ˆPi, Pi)]. Lemma 6. For a distribution Pi l, let ˆPi = (ˆpij)l j=1 be the empirical distribution constructed from Ti i.i.d. draws from Pi. Then, we have

1 2πTi C(s) i Ti E Ds( ˆPi, Pi) c(s) i

1 2πTi + C(s) i Ti ,

where C(s) i and C(s) i are Pi-dependent constants deﬁned by (47) and (50) in Appendix H.

Adaptive Sampling for Distribution Estimation

The proof of the upper bound in Lemma 6 proceeds by ﬁrst upper bounding the term inside the expectation with a normalized sum of random variables, and then using a non-asymptotic version of the Central Limit Theorem (Ross, 2011, Thm. 3.2). For deriving the lower bound, we ﬁrst show (Lemma 12 in Appendix H) that bunching together probability masses can only reduce the separation distance between two distributions, and then proceed by another application of (Ross, 2011, Thm. 3.2).

Lemma 6 gives us an interval that contains the true objective function we aim to track. To implement the adaptive scheme, we employ the approximate objective function ϕ(c(s) i , Ti) := c(s) i p

1/2πTi. In order to instantiate Algorithm 1 for Ds, we require to derive high probability conﬁdence intervals for the terms p

(1 pij)/pij in the deﬁnition (c(s) i )K i=1. We use the event E1 deﬁned in Lemma 2 and prove the following result:

Lemma 7. Let Pi (η) l , and the event E1 and the terms δt and eij,t deﬁned as in Lemma 2. Deﬁne the terms ai,t := 8 log(2/δt)/Ti,t 1/4 and bi,t := l ai,t

η max 1, ai,t 2η3/2 .

Then, under the high probability event E1, we have

1 ˆpij,t eij,t 1

1 pij 1+bi,t.

Using the concentration result of Lemma 7, we can now implement Algorithm 1 with the upper-bound u(s) i,t = Pl j=1 q

1 ˆpij,t eij,t 1 /(

2π), if ˆpij,t 7eij,t/2, and

u(s) i,t = + , otherwise. This will give us an adaptive allocation scheme for the separation distance, which we shall refer to it as As. Finally, we prove the following regret bound for As (proof in Appendix H.3).

Theorem 5. Let Pi (η) l and the adaptive scheme As is implemented with δ = η/n. Then, for large enough n and with Es := max1 i K |Ti e T i |, we have

Es = O( n), and

Rn(As, Ds) = O maxi [K] c(s) i c(s) i

The exact condition for n, and the expressions for Es and the higher-order terms in (10) are given in Appendix H.3.

Remark 2. Note that the second term on the RHS of (10) is the approximation error term in the regret decomposition introduced in Lemma 1, while the second term is the tracking regret w.r.t. the approx-oracle allocation scheme. In general, the approximation error, which is O(n 1/2), dominates the tracking regret term, which is O(n 3/4). However, for the special case of l = 2, the approximation error term becomes O(1/n) using the fact that c(s) i = c(s) i in Lemma 6, and we achieve an overall regret of O(n 3/4).

5. Lower Bound Lemma 1 provided a general high probability bound on the deviation of the adaptive allocation (Ti)K i=1 from the approxoracle allocation ( e T i )K i=1. In Sec. 4, we observed that when specialized to the objective functions corresponding to Dℓ2, Dℓ1 and Ds, we have |Ti e T i | = O( n). A natural question to ask is whether there exists any other adaptive scheme that can achieve a smaller deviation from the approxoracle allocation. We now show that this is not the case by deriving a lower-bound on the expected deviation of any allocation scheme A.

To derive the lower-bound, we consider a speciﬁc class of problems with two arms, K = 2, Bernoulli distributions, l = 2, and objective functions of the form ϕ(ci, Ti) = ci/T α i , for some α > 0. For some p0 (1/2, 1) and ϵ > 0, we deﬁne two Bernoulli distributions P1 Ber (p0) and P2 Ber (p0 ϵ). We consider two problem instances P1 and P2 with K = 2 and distributions P1 and P2, but with orders swapped, i.e., P1 = (P1, P2) and P2 = (P2, P1). Finally, we introduce the notation κ (p) to represent the distribution dependent constant in the objective function ϕ corresponding to a Ber (p) distribution. We now state the lower bound result. Theorem 6. For some p0 (1/2, 3/4] and 0 < ϵ < p0 1/2, consider two tracking problems P1 = (P1, P2) and P2 = (P2, P1), with P1 Ber (p0) and P2 Ber (p0 ϵ) and objective function ϕ(c, T) = c/T α for α > 0 where the constant c = κ (p) for Ber (p) distributions. Finally, introduce the notation τ = (n/2)(|κ (p0)1/α κ (p0 ϵ)1/α |)/(|κ (p0)1/α + κ (p0 ϵ)1/α |). If (Ti)2 i=1 denotes the allocation of any allocation scheme A, we have

max P1,P2 max i=1,2 E h |Ti e T i | i sup 0<ϵ<p0 1/2 Γϵ(κ, p0),

where Γϵ(κ, p0) = τ

As an immediate corollary of Theorem 6, we can observe that the deviation of the optimistic tracking scheme from the approx-oracle for Dℓ2, Dℓ1 and Ds cannot be improved upon by any adaptive scheme.

Corollary 1. For p0 = 3/4, ϵ = 1/(4 n) and the κ arising in the study of Dℓ2, Dℓ1 and Ds, we have Γϵ(κ, p0) = Ω( n). The proofs of Theorem 6 and Corollary 1 are provided in Appendix I.

Remark 3. Note that in Theorem 6, we present an algorithm independent lower bound on the allocation and not on the regret. The main reason is that our problem does not admit a straightforward regret decomposition as in the case of multi-armed bandit problems (Lattimore & Szepesv ari, 2018, Lemma 4.5). Nevertheless, Theorem 6 establishes the optimality of our proposed algorithm in terms of the deviation from the optimal allocation for ℓ2, ℓ1 and separation distances. Furthermore, it also establishes a similar sense

Adaptive Sampling for Distribution Estimation

of optimality of the algorithms of (Antos et al., 2008) and (Carpentier et al., 2011) for the problem of learning the mean of K distributions in squared error sense.

6. Experiments

Setup. We study the performance of the proposed adaptive schemes on a problem with K = 2, and l = 10. We set P1 as the uniform distribution in l and P2 = Pϵ for ϵ {0.1, 0.2, . . . , 0.9}, where Pϵ = (pj)l j=1 with p1 = ϵ and pj = (1 ϵ)/(l 1) for 1 < j l. To compare the performance of the adaptive schemes, we used three baseline schemes: (i) Uniform allocation, in which each arm is allocated n/K samples. Note that the uniform allocation is the oracle scheme for Dχ2 (see Appendix G.4), (ii) Greedy allocation, in which the arms are pulled by plugging in the current empirical estimate (ˆci,t)K i=1 of (ci)K i=1 in the objective function, and (iii) Forced Exploration, in which the arms are pulled according to the greedy scheme, while also ensuring that at any time t, each arm is pulled at least

t times. This scheme is motivated by the strategy of Antos et al. (2008). For every value of ϵ, we ran 500 trials of all the allocation schemes with the budget n = 5000. We focus our experiments on the ℓ2 2, ℓ1 and separation distances, since we observed no statistically signiﬁcant difference in the performance of the different schemes for KL-divergence. To compare the performance of the allocation schemes, we plot the term ϕ(ci, Ti) ϕ(ci, e T i ). Observations. We plot the ϕ(ci, Ti) ϕ(ci, e T i ) values for the different allocation schemes and loss functions in Figs. 1, 2 and 3. As we can see from Fig. 1, the adaptive scheme outperforms the uniform allocation for the three distance metrics for both ϵ = 0.5 and ϵ = 0.9. Note that as ϵ increases, the optimal allocation get more skewed, and hence the gap in performance between uniform and adaptive also increases. The greedy and forced exploration schemes, both perform comparably to our proposed adaptive scheme for ϵ = 0.5, although their resulting allocations have higher variability especially for ℓ2 2 and ℓ1 distances. For the case of ϵ = 0.9 however, the adaptive scheme performs signiﬁcantly better than both greedy and forced exploration methods for ℓ2 2 and ℓ1 distances, and result in a lower variance solution for separation distance. 7. Extensions We now discuss two extensions of the results of the previous sections. Continuous Distributions. The results presented in this paper can be extended to some classes of continuous distributions and some distance measures. For instance, assume that (Pi)K i=1 are continuous distributions on [0, 1] which admit density functions (νi)K i=1 which can be expanded in terms of a ﬁnite number of orthonormal basis functions

Figure 1: Comparison of Algorithm 1 with Uniform Allocation for ℓ2 2 (top), ℓ1 (middle) and separation distance (bottom) for ϵ = 0.5 (left) and ϵ = 0.9 (right).

Figure 2: Comparison of Algorithm 1 with Greedy Allocation for ℓ2 2 (top), ℓ1 (middle) and separation distance (bottom) for ϵ = 0.5 (left) and ϵ = 0.9 (right).

(ψj)l j=1, i.e., νi = Pl j=1 aijψj. By using appropriate basis functions, such as Fourier Basis and wavelet basis, a large class of density functions can be modeled under this assumption. For constructing an estimate of νi, denoted by ˆνi, we can employ the projection estimator (Tsybakov, 2009, 1.7) which estimates the coefﬁcients of the basis expansion using the observations. By exploiting the orthonormality of (ψj)l j=1, we can show that the expectation of integrated mean-squared error, i.e., E[ R (ˆνi νi)2dx], belongs to F. With this objective function available, we can instantiate the optimistic tracking algorithm and derive bounds on the regret similar to the discrete case. The details about the estimator construction and the objective function derivation are provided in Appendix J. Minimizing Average Discrepancy. In this paper, our focus has been on minimizing the maximum distance between the estimate and true distributions, i.e., optimization problems (1) and (4). An important alternative formulation that has been studied in the bandit literature involves minimizing the average discrepancy (Carpentier & Munos, 2011; Riquelme et al., 2017). Our results, in particular our general tracking scheme, can be extended to this case and we are able to provide adaptive allocation strategies to minimize the average distance between distributions, for all distances studied in this paper. Consider the following tracking/optimization problem, which is the equivalent of (4) for the average case:

min T1,...,TK 1 K

i=1 ϕi(ci, Ti) s.t.

i=1 Ti = n. (11)

If ϕi s are convex in Ti, then the optimal solution must satisfy 1

K ϕi(ci,Ti)

Ti λ = 0, for all i [K] and for some

Adaptive Sampling for Distribution Estimation

Figure 3: Comparison of Algorithm 1 with Forced Allocation for ℓ2 2 (top), ℓ1 (middle) and separation distance (bottom) for ϵ = 0.5 (left) and ϵ = 0.9 (right).

λ R. Thus, if the Ti-derivatives of ϕi are in the function class F (Deﬁnition 1), then (11) can be solved using the tools developed in Section 3. It is easy to show that the distances studied in this paper (i.e., ℓ2, ℓ1, KL, and separation) satisfy this condition.

Wasserstein Distances. An important class of distance metrics between two probability measures µ, ν is the Wasserstein family of distances, denoted by Wp(µ, ν) for p 1. These distance metrics have recently been used in several problems in machine learning and statistics. Some existing results, such as (Weed et al., 2019, Proposition 1), derive sharp relations between Wp(µ, ν) and the ℓ1 distance between the discrete distributions obtained by restricting µ and ν to a given partition Q of the input space. Thus for a ﬁxed partition Q, our analysis of ℓ1 distance can be used to get an upper bound on the regret of learning two distributions in Wp distance. Such an upper bound would depend on the properties of the partition Q (such as cardinality and diameter of its elements) in addition to the sampling budget n. An interesting question for future work is designing an adaptive method of constructing the partition Q given n in order to achieve the tightest bounds on the regret for learning distributions in terms of Wp. 8. Conclusion We studied the problem of allocating a ﬁxed budget of samples to learn K discrete distributions uniformly well in terms of four distance measures: ℓ2 2, ℓ1, f-divergence, and separation. We proposed a general optimistic tracking strategy for problems with concave-convex and differentiable objective functions and then showed that this class of functions is rich enough to either contain or well approximate the true objective functions of all the considered distances. We then derived regret bounds for the proposed algorithm for all four distances. We showed that the allocation performance of the proposed scheme cannot in general be improved, by deriving lower-bounds. We also empirically veriﬁed our theoretical ﬁndings through numerical experiments. Finally, we ended with a discussion on extending our results to certain classes of continuous distributions and to a related setting of average error minimization. Following the style of results presented in the related works

of (Antos et al., 2008; Carpentier et al., 2011), we derived upper-bounds on the regret in terms of the budget n and in the large n regime, with l and K ﬁxed. However, there are several interesting directions not considered in this paper, which can be explored in future work, such as 1) improving the performance of the adaptive algorithms and the hidden constants in the regret-bounds by employing stronger concentration results, 2) handling the large l case by using appropriate estimators such as the estimator of Santhanam et al. (2007), and the large K case by imposing some additional similarity assumptions among the different arms similar to Bubeck et al. (2011), and 3) extending the results of the paper to the general problem of learning the dynamics (model) of a ﬁnite MDP, as discussed in Section 1.

Adaptive Sampling for Distribution Estimation

Aldous, D. and Diaconis, P. Strong uniform times and ﬁnite random walks. Advances in Applied Mathematics, 8(1): 69 97, 1987.

Antos, A., Grover, V., and Szepesv ari, C. Active learning in multi-armed bandits. In International Conference on Algorithmic Learning Theory, pp. 287 302, 2008.

Auer, P., Cesa-Bianchi, N., and Fischer, P. Finite-time analysis of the multiarmed bandit problem. Machine learning, 47(2-3):235 256, 2002.

Barron, A., Gyorﬁ, L., and van der Meulen, E. Distribution estimation consistent in total variation and in two types of information divergence. IEEE transactions on Information Theory, 38(5):1437 1454, 1992.

Blyth, C. Expected absolute error of the usual estimator of the binomial parameter. The American Statistician, 34(3): 155 157, 1980.

Bubeck, S., Munos, R., Stoltz, G., and Szepesv ari, C. Xarmed bandits. Journal of Machine Learning Research, 12(May):1655 1695, 2011.

Carpentier, A. and Munos, R. Finite time analysis of stratiﬁed sampling for monte carlo. In Advances in Neural Information Processing Systems 24, pp. 1278 1286, 2011.

Carpentier, A., Lazaric, A., Ghavamzadeh, M., Munos, R., and Auer, P. Upper-conﬁdence-bound algorithms for active learning in multi-armed bandits. In International Conference on Algorithmic Learning Theory, pp. 189 203, 2011.

Cheung, W. C. Regret minimization for reinforcement learning with vectorial feedback and complex objectives. In Advances in Neural Information Processing Systems, pp. 724 734, 2019.

Csisz ar, I. Two remarks to noiseless coding. Information and Control, 11(3):317 322, 1967.

Csisz ar, I. Generalized cutoff rates and r enyi s information measures. IEEE Transactions on information theory, 41 (1):26 34, 1995.

Diaconis, P. and Zabell, S. Closed form summation for classical distributions: variations on a theme of de moivre. Statistical Science, pp. 284 302, 1991.

Durrett, R. Probability: theory and examples, volume 49. Cambridge university press, 2019.

Gibbs, A. and Su, F. On choosing and bounding probability metrics. International statistical review, 70(3):419 435, 2002.

Gyorﬁ, L., Morvai, G., and Vajda, I. Information-theoretic methods in testing the goodness of ﬁt. In 2000 IEEE International Symposium on Information Theory (Cat. No. 00CH37060), pp. 28. IEEE, 2000.

Harris, B. The statistical estimation of entropy in the non-parametric case. Technical report, University of Wisconsin-Madison Mathematics Research Center, 1975.

Hazan, E., Kakade, S., Singh, K., and Soest, A. V. Provably efﬁcient maximum entropy exploration. In Proceedings of the 36th International Conference on Machine Learning, 2019.

Kamath, S., Orlitsky, A., Pichapati, D., and Suresh, A. On learning distributions from their samples. In Conference on Learning Theory, pp. 1066 1100, 2015.

Kaufmann, E. and Garivier, A. Learning the distribution with largest mean: two bandit frameworks. ESAIM: Proceedings and Surveys, 60:114 131, 2017.

Lattimore, T. and Szepesv ari, C. Bandit algorithms. preprint, 2018.

Maurer, A. and Pontil, M. Empirical bernstein bounds and sample-variance penalization. In COLT, 2009.

Riquelme, C., Ghavamzadeh, M., and Lazaric, A. Active learning for accurate estimation of linear models. In Proceedings of the 34th International Conference on Machine Learning, pp. 2931 2939, 2017.

Rivasplata, O. Subgaussian random variables: An expository note. Technical Report, 2012.

Ross, N. Fundamentals of Stein s method. Probability Surveys, 8:210 293, 2011.

Santhanam, N., Orlitsky, A., and Viswanathan, K. New tricks for old dogs: Large alphabet probability estimation. In 2007 IEEE Information Theory Workshop, pp. 638 643. IEEE, 2007.

Soare, M., Lazaric, A., and Munos, R. Active learning in linear stochastic bandits. In Bayesian Optimization in Theory and Practice, 2013.

Tarbouriech, J. and Lazaric, A. Active exploration in markov decision processes. In Proceedings of the 22nd International Conference on Artiﬁcial Intelligence and Statistics, 2019.

Tsybakov, A. B. Introduction to nonparametric estimation. Springer Science & Business Media, 2009.

Weed, J., Bach, F., et al. Sharp asymptotic and ﬁnite-sample rates of convergence of empirical measures in wasserstein distance. Bernoulli, 25(4A):2620 2648, 2019.