# sequential_mode_estimation_with_oracle_queries__d7763ada.pdf

The Thirty-Fourth AAAI Conference on Artiﬁcial Intelligence (AAAI-20)

Sequential Mode Estimation with Oracle Queries

Dhruti Shah,1 Tuhinangshu Choudhury,1 Nikhil Karamchandani,1 Aditya Gopalan2

1Indian Institute of Technology, Bombay 2Indian Institute of Science, Bangalore {dhruti96shah, choudhurytuhinangshu}@gmail.com, nikhilk@ee.iitb.ac.in, aditya@iisc.ac.in

We consider the problem of adaptively PAC-learning a probability distribution P s mode by querying an oracle for information about a sequence of i.i.d. samples X1, X2, . . . generated from P. We consider two different query models: (a) each query is an index i for which the oracle reveals the value of the sample Xi, (b) each query is comprised of two indices i and j for which the oracle reveals if the samples Xi and Xj are the same or not. For these query models, we give sequential mode-estimation algorithms which, at each time t, either make a query to the corresponding oracle based on past observations, or decide to stop and output an estimate for the distribution s mode, required to be correct with a speciﬁed conﬁdence. We analyze the query complexity of these algorithms for any underlying distribution P, and derive corresponding lower bounds on the optimal query complexity under the two querying models.

1 Introduction Estimating the most likely outcome of a probability distribution is a useful primitive in many computing applications such as counting, natural language processing, clustering, etc. We study the probably approximately correct (PAC) sequential version of this problem in which the learner faces a stream of elements sampled independently and identically distributed (i.i.d.) from an unknown probability distribution, wishing to learn an element with the highest probability mass on it (a mode) with conﬁdence. At any time, the learner can issue queries to obtain information about the identities of samples in the stream, and aims to use as few queries as possible to learn the distribution s mode with high conﬁdence. Speciﬁcally, we consider two natural models for sample identity queries (a) each query, for a single sample of the stream so far, unambiguously reveals the identity (label) of the sample, (b) each query, for a pair of samples in the stream, reveals whether they are the same element or not. A concrete application of mode estimation (and one of the main reasons that led to this formulation) is the problem of adaptive, partial clustering, where the objective is to

N. Karamchandani s research was supported in part by Indo French grant No. IFC/DST-Inria-2016-01/448 Machine Learning for Network Analytics Copyright c 2020, Association for the Advancement of Artiﬁcial Intelligence (www.aaai.org). All rights reserved.

ﬁnd the largest cluster (i.e., equivalence class) of elements as opposed to learning the entire cluster grouping (Mazumdar and Saha 2017a; 2017b; 2017c; 2016). We are given a set of elements with an unknown clustering or partition, and would like to ﬁnd the elements comprising the largest cluster or partition. Suppose a stream of elements is sampled uniformly and independently from the set, and at each time one can ask a comparison oracle questions of the form: Do two sampled elements u and v belong to the same cluster or not? Under this uniformly sampled distribution for the element stream, the probability of an element belonging to a certain cluster is simply proportional to the cluster s size, so learning the heaviest cluster is akin to identifying the mode of the distribution of a sampled element s cluster label.

We make the following contributions towards understanding the sequential query complexity for estimating the mode of a distribution using a stream of samples. (a) For both the individual-label and pairwise similarity query models, we give sequential PAC query algorithms which provably output a mode of the sample-generating distribution with large probability, together with guarantees on the number of queries they issue. These query complexity upper bounds explicitly depend on parameters of the unknown discrete probability distribution, in that they scale inversely with the gap between the probability masses at the mode and at the other elements in the distribution s support. The proposed algorithms exploit the probabilistic i.i.d. structure of the data stream to resolve uncertainty about the mode in a queryefﬁcient fashion, and are based on the upper and lower conﬁdence bounds (UCB, LCB) principle from online learning to guide adaptive exploration across time; in fact, we employ more reﬁned empirical Bernstein bounds (Maurer and Pontil 2009) to take better advantage of the exact structure of the unknown sample distribution. (b) We derive fundamental limits on the query complexity of any sequential modeﬁnding algorithm for both query models, whose constituents resemble those of the query complexity upper bounds for our speciﬁc query algorithms above. This indicates that the algorithms proposed make good use of their queries and the associated information in converging upon a mode estimate. (c) We report numerical simulation results that support our theoretical query complexity performance bounds.

1.1 Related Work The mode estimation problem has been studied classically in the batch or non-sequential setting since many decades back, dating to the work of Parzen (Parzen 1962) and Chernoff (Chernoff 1964), among others. This line of work, however, focuses on the objective of consistent mode estimation (and the asymptotic distribution of the estimate) for continuous distributions, instead of ﬁnite-time PAC guarantees for large-support discrete distributions as considered here. Our problem is essentially a version of sequential composite hypothesis testing with adaptive actions or queries, and with an explicit high-conﬁdence requirement on the testing algorithm upon stopping. There has been a signiﬁcant amount of work in the streaming algorithms community, within computer science, on the heavy hitter problem detecting the most frequent symbol in an arbitrary (non-stochastic) sequence and generalizations thereof pertaining to estimation of the empirical moments, see e.g., (Misra and Gries 1982), (Karp, Shenker, and Papadimitriou 2003), (Manku and Motwani 2002). However the focus here is on understanding resource limits, such as memory and computational effort, on computing on arbitrary (non stochastic / unstructured) streams that arise in highly dynamic network applications. We are instead interested in quantifying the statistical efﬁciency of mode estimation algorithms in terms of the structure of the generating probability distribution. Adaptive decision making and resolution of the exploreexploit trade off is the subject of work on the well-known multi-armed bandit model, e.g., (Bubeck, Cesa-Bianchi, and others 2012). At an abstract level, our problem of PAC-mode estimation is like a multi-armed bandit best arm identiﬁcation problem (Kaufmann, Capp e, and Garivier 2016) but with a different information structure queries are not directly related to any utility structure for rewards as in bandits. Perhaps the closest work in spirit to ours is the recent work by Mazumdar and co-authors (Mazumdar and Saha 2017a; 2017b; 2017c; 2016), where the aim is to learn the entire structure of an unknown clustering by making information queries. In this regard, studying the mode estimation problem helps to shed light on the simpler, and often more natural, objective of merely identifying the largest cluster in many machine learning applications, which has not been addressed by previous work.

2 Problem formulation In this section we develop the required notation and describe the query models. Consider an underlying unknown discrete probability distribution P with the support set {1, 2, ...k}. For each i {1, 2, . . . , k} and a random variable X P, let Pr(X = i) = pi(P) pi. We would like to estimate the mode of the unknown distribution P, deﬁned as any member of the set1 arg max1 i k pi. Towards this, we assume query access

1arg maxi S pi is used to denote the set of all maximisers of the function i pi on S.

to an oracle containing a sequence of independently and identically distributed (i.i.d.) samples from P, denoted X1, X2, . . . We study the mode estimation problem under the following query models to access the values of these i.i.d. samples:

1. Query Model 1 (QM1) : For each query, we specify an index i 1 following which the oracle reveals the value of the sample Xi to us. Since the samples are i.i.d., without loss of generality, we will assume that the tth successive query reveals Xt, t = 1, 2, . . .

2. Query Model 2 (QM2) : In this model, the oracle answers pairwise similarity queries. For each query, we specify two indices i, j {1, 2, . . .}, following which the oracle reveals if the two samples Xi and Xj are equal or not. Formally, the response of the oracle to a query (i, j) is

O(i, j) = +1 if Xi = Xj, 1 otherwise.

Note that to know the value of a sample Xi in this query model, multiple pair-wise queries to the oracle might be required.

For each of the query models above, our goal is to design a statistically efﬁcient sequential mode-estimation algorithm which, at each time t, either makes a query to the oracle based on past observations or decides to stop and output an estimate for the distribution s mode. Mathematically, a sequential algorithm with a stopping rule decides an action At at each time t 1 depending only on past observations. For QM1, At can be one of the following:

(continue,t): Query the index t,

(stop, ˆm), ˆm {1, . . . , k}: Stop querying and return ˆm as the mode estimate.

For QM2, At can be one of the following:

(continue,t): Continue with the next round, with possibly multiple sequential pairwise queries of the form (t, j) for some j < t. That is, we compare the sample Xt with some or all of the previous samples.

(stop, ˆm), ˆm {1, . . . , k}: Stop querying and return ˆm as the mode estimate.

The stopping time of the algorithm is deﬁned as

τ := inf{t 1 : At = (stop, .)}.

The cost of the algorithm is measured by its query complexity the number of queries made by it before stopping. For δ > 0, a sequential mode-estimation algorithm is deﬁned to be a δ-true mode estimator if it correctly identiﬁes the mode for every distribution P on the support set {1, 2, . . . , k} with probability at least 1 δ, i.e., PP[ ˆm arg max1 i k pi(P)] 1 δ. The goal is to obtain δ-true mode estimators for each query model (QM1 and QM2) that require as few queries as possible. For a δtrue mode estimator A and a distribution P, let QP δ (A) denote the number of queries made by a δ-true mode estimator when the underlying unknown distribution is P. We are interested in studying the optimal query complexity of δ-true

mode estimators. Note that QP δ (A) is itself a random quantity, and our results either hold in expectation or with high probability. For the purpose of this paper, we assume that p1 > p2 .... pk i.e. the mode of the underlying distribution is 1, and hence a δ-true mode estimator returns 1 with probability at least (1 δ). In Sections 3 and 4, we discuss δ-true mode estimators and analyze their query complexity for the QM1 and QM2 query models respectively. We provide some experimental results in Section 5 and further explore a few variations of the problem in Section 6. Several proofs have been relegated to the full version of our paper (Shah et al. 2019).

3 Mode estimation with QM1 We will begin by presenting an algorithm for mode estimation under QM1 and analyzing its query complexity.

3.1 Algorithm Recall that under the QM1 query model, querying the index t to the oracle reveals the value of the corresponding sample, Xt, generated i.i.d. according to the underlying unknown distribution P. During the course of the algorithm, we form bins for each element i in the support {1, 2, . . . , k} of the underlying distribution. Bin j is created when the ﬁrst sample with value j is revealed and any further samples with that value are placed in the same bin. For each query t and revealed sample value Xt, deﬁne Zi t for i {1, 2, ...k} as follows.

Zi t = 1 if Xt = i 0 otherwise.

Note that for each given i, t, Zi t is a Bernoulli random variable with E[Zi t] = pi. Also for any given i, {Zi t} are i.i.d. over time. Our mode estimation scheme is presented in Algorithm 1. At each stage of the algorithm, we maintain an empirical estimate of the probability of bin i, pi, for each i {1, 2, . . . , k}. Let ˆpt i denote the estimate at time t, given by

t j=1 Zi j t , (1)

where recall that Zi j for j = 1, 2....t are the t i.i.d. samples. Also, at each time instant, we maintain conﬁdence bounds for the estimate of pi. The conﬁdence interval for the ith bin probability at the tth iteration is denoted by βt i, and it captures the deviation of the empirical value ˆpt i from its true value pi. In particular, the conﬁdence interval value βt i is chosen so that the true value pi lies in the interval [ˆpt i βt i, ˆpt i + βt i] with a signiﬁcantly large probability. The lower and upper boundaries of this interval are referred to as the lower conﬁdence bound (LCB) and the upper conﬁdence bound (UCB) respectively. The particular choice for the value of βt i that we use for our algorithm is presented in Section 3.2. Finally, our stopping rule is as follows : At = (stop, i) when there exists a bin i {1, 2, . . . , k} whose LCB is greater than the UCB of all the other bins, upon which the index i is output as the mode estimate.

Given the way our conﬁdence intervals are deﬁned, this ensures that the output of the estimator is the mode of the underlying distribution with large probability.

Algorithm 1 Mode estimation algorithm under QM1

1: t = 1 2: A0 = (continue,1) : Obtain X1. 3: loop 4: if a bin with value Xt already exists then 5: Add query index t to the corresponding bin. 6: else 7: Create a new bin with value Xt. 8: end if 9: Update the empirical estimate ˆpt i (1) and conﬁdence interval βt i (2) for all bins i {1, 2, . . . , k}. 10:

(stop,i): if i, s.t. j = i Exit, Output i ˆpt i βt i > ˆpt j + βj(t), (continue,t+1): otherwise Obtain Xt+1

11: t = t + 1 12: end loop

3.2 Analysis Theorem 1. For the following choice of βt i, Algorithm 1 is a δ-true mode estimator.

2Vt(Zi) log(4kt2/δ)

t + 7 log(4kt2/δ)

3(t 1) , (2)

where Vt(Zi) = 1 t(t 1)

1 p<q t (Zi p Zi q)2 is the empir-

ical variance.

Proof. This proof is based on conﬁdence bound arguments. To construct the conﬁdence intervals for the probability values {pi} s, we use the empirical version of the Bernstein bound given in (Maurer and Pontil 2009). The result used is stated as Theorem 9 in (Shah et al. 2019, Appendix A). Using the result in our context, we get the following for any given pair (i, t) with probability at least (1 δ1):

2Vt(Zi) log(2/δ1)

t + 7 log(2/δ1)

3(t 1) , (3)

where Vt(Zi) is the sample variance, i.e., Vt(Zi) = 1 t(t 1)

1 p<q t (Zi p Zi q)2.

To establish conﬁdence bounds on the sample variance, we use the result given in (Maurer and Pontil 2009), which is stated as Theorem 10 in (Shah et al. 2019, Appendix A). Using the result in our context, and noting that the expected value of Vt(Zi) would be pi(1 pi), we get the following for any given pair (i, t) with probability at least (1 δ2):

2 log(1/δ2)

Let E1 denote the error event that for some pair (i, t) the conﬁdence bound (3) around ˆpt i is violated. Also, let E2 denote the error event that for some pair (i, t) the conﬁdence bound (4) around the sample variance Vt(Zi) is violated. Choosing δ1 = δ2 = δ 2kt2 , and taking the union bound over all i, t, from (3) and (4), we get P[E1] δ/2 and P[E2] δ/2. Hence we get that

P[Ec 1 Ec 2] 1 δ. (5)

This means that with probability at least 1 δ the conﬁdence bounds corresponding to both equations (3) and (4) hold true for all pairs (i, t). We now show that if the event Ec 1 Ec 2 is true, then Algorithm 1 returns 1 as the mode. To see this, assume the contrary that the algorithm returns i = 1. Under Ec 1 Ec 2 the conﬁdence intervals hold true for all pairs (i, t) and hence the stopping condition deﬁned in Line 10, Algorithm 1 will imply pi ˆpt i βt i > ˆpt 1 + βt 1 p1 which is false. Thus Algorithm 1 returns 1 as the mode if the event Ec 1 Ec 2 is true. Hence, the probability of returning 1 as the mode is at least P[Ec 1 Ec 2], which by (5) implies that Algorithm 1 is a δ-true mode estimator.

3.3 Query Complexity Upper bound Theorem 2. For a δ-true mode estimator A1, corresponding to Algorithm 1, we have the following with probability at least (1 δ).

QP δ (A1) 592

3 p1 (p1 p2)2 log

k δ p1 (p1 p2)2

Proof. If the event Ec 1 Ec 2 is true, it implies that all conﬁdence intervals hold true. We ﬁnd a value T , which ensures that by t = T , the conﬁdence intervals of the bins have separated enough such that the algorithm stops. Since at each time one query is made, this value of T would also be an upper bound on the query complexity QP δ (A). The derivation of T has been relegated to (Shah et al. 2019, Appendix B), which gives the upper bound as stated above.

3.4 Query Complexity Lower bound Theorem 3. For any δ-true mode estimator A, we have,

E[QP δ (A)] p1 (p1 p2)2 log(1/2.4δ).

Proof. Let x(P) = arg maxi [k] pi denote the mode of the distribution P. By assumption, we have x(P) = 1. Consider any P such that x(P ) = x(P) = 1. Let A be a δ-true mode estimator and let ˆm be its output. Then, by deﬁnition,

P( ˆm = 1) 1 δ

P ( ˆm = 1) δ. (6)

Let τ be the stopping time associated with estimator A. Using Wald s lemma (Wald 1944) we have,

t=1 log p(Xt)

= EP[τ].D(P||P ). (7)

where D(P||P ) refers to the Kullback-Leibler (KL) divergence between the distributions P and P . Also,

t=1 log p(Xt)

log p(X1, X2....Xτ)

p (X1, X2....Xτ)

= D(p(X1, ....Xτ)||p (X1, ....Xτ))

D(Ber(P( ˆm = 1))||Ber(P ( ˆm = 1))),

where Ber(x) denotes the Bernoulli distribution with parameter x (0, 1) and the last step is obtained by the data processing inequality (Cover and Thomas 2012). Furthermore we have,

D(Ber(P( ˆm = 1))||Ber(P ( ˆm = 1))) log(1/2.4δ),

where the above follows from (6) and (Kaufmann, Capp e, and Garivier 2016, Remark 2). Finally, combining with (7) and noting that the argument above works for any P such that x(P) = x(P ), we have

EP[τ] log(1/2.4δ) inf P :x(P ) =x(P) D(P||P ). (8)

We choose P as follows, for some small ϵ > 0:

p i = pi, i > 2

p 1 = p1 + p2

2 ϵ = q ϵ, where q = p1 + p2

p 2 = p1 + p2

2 + ϵ = q + ϵ

D(P||P ) = p1 log p1

+ p2 log p2

= (p1 + p2)

p1 p1 + p2 log

p2 p1 + p2 log

= (p1 + p2)D p1 p1 + p2 || q ϵ

(a) (p1 + p2) (p1 q + ϵ)2

(q ϵ) (q + ϵ)

= (p1 + p2) (p1 p2 + 2ϵ)2

(p1 + p2 + 2ϵ) (p1 + p2 2ϵ)

(b) (p1 p2)2

where (a) follows from (Popescu et al. 2016, Theorem 1.4) which gives an upper bound on the KL divergence and (b)

follows because ϵ can be arbitrarily small. Note that an upper bound on the stopping time τ would also give an upper bound on QP δ (A), since we have one query at each time. Hence, using (8) and (9), we get the following lower bound:

E[QP δ (A)] p1 (p1 p2)2 log(1/2.4δ).

4 Mode estimation with QM2

In this section, we present an algorithm for mode estimation under QM2 and analyse its query complexity.

4.1 Algorithm

Recall that under the QM2 query model, a query to the oracle with indices i and j reveals whether the samples Xi and Xj have the same value or not. Our mode estimation scheme is presented in Algorithm 2. During the course of the algorithm we form bins, and the ith bin formed is referred to as bin i. Here, each bin is a collection of indices having the same value as revealed by the oracle. Let σ(i) denote the element from the support that that ith bin represents. The algorithm proceeds in rounds. In round t, let b(t) denote the number of bins present. Based on the observations made so far, we form a subset of bins C(t). We go over the bins in C(t) one by one, and in each iteration within the round we query the oracle with index t and a sample index jl from some bin l, i.e., jl bin l for l C(t). The round ends when the oracle gives a positive answer to one of the queries or we exhaust all bins in C(t). So, the number of queries in round t is at most |C(t)| b(t). If we get a positive answer from the oracle, we add t to the corresponding bin. A new bin will be created if the oracle gives a negative answer to each of these queries. We will now describe how we construct the subset C(t) of bins we compare against in round t. To do so, for each bin i, corresponding to σ(i), and for each round t, we maintain an empirical estimate of each pσ(i) during each round, denoted by ˆpt σ(i). We also maintain conﬁdence intervals for each pσ(i), denoted by βt σ(i). The choice for βt σ(i) is the same as that in QM1, given in (2). C(t) is formed in each round t by considering only those bins whose UCB is greater than the LCB of all other bins. Mathematically,

C(t) = {i {1, 2, . . . , b(t)} such that l for which

ˆpt l βt l > ˆpt σ(i) + βt σ(i)} (10)

The rest of the algorithm is similar to Algorithm 1, including the stopping rule, i.e. we stop when there is a bin whose LCB is greater than the UCB of all other bins in C(t). As before, the choice of conﬁdence intervals and the stopping rule ensures that when At = (stop, i), the corresponding element σ(i) = 1 with probability at least (1 δ).

4.2 Analysis

Theorem 4. For the choice of βt i as given by (2), Algorithm 2 is a δ-true mode estimator.

Algorithm 2 Mode estimation algorithm under QM2

1: t = 1 2: A0 = (continue,1) : Obtain X1. 3: loop 4: Form C(t) according to (10). 5: ﬂag=0. 6: for all bins l C(t) do 7: Obtain jl bin i . 8: if O(t, jl) == +1 then 9: Add t to the corresponding bin. 10: ﬂag=1; BREAK 11: end if 12: end for 13: if ﬂag==0 then 14: Create a new bin for index t. 15: end if 16: Update the empirical estimate ˆpt i (1) and conﬁdence interval βt i (2) for all bins i {1, 2, . . . , k}. 17:

(stop,i): if i, s.t. j C(t) Exit, Output i ˆpt i βt i > ˆpt j + βj(t), (continue,t+1): otherwise Move to next round

18: t = t + 1 19: end loop

Proof. The error analysis for Algorithm 2 is similar as that for Algorithm 1 as given in Section 3.2. We consider the same two events E1 and E2. Recall that E1 denotes the error event that for some pair (i, t) the conﬁdence bound (3) around ˆpt i is violated; E2 denotes the error event that for some pair (i, t) the conﬁdence bound (4) around the sample variance Vt(Zi) is violated. Again choosing similar values for δ1 and δ2, we get P[Ec 1 Ec 2] 1 δ. We now need to show that if the event Ec 1 Ec 2 is true, then Algorithm 2 returns 1 as the mode. This then implies that Algorithm 2 is a δ-true mode estimator. The analysis remains same as discussed for QM1. The only additional factor we need to consider here, is the event that the bin corresponding to the mode 1 of the underlying distribution is discarded in one of the rounds of the algorithm, i.e., it is not a part of C(t) for some round t. A bin is discarded when its UCB becomes less than the LCB of any other bin. Under event Ec 1, all conﬁdence intervals are true, and since p1 > pi, i the UCB of the corresponding bin can never be less than the LCB of any other bin. This implies that under Ec 1, the bin corresponding to the mode 1 is never discarded. Hence, the probability of returning 1 as the mode is at least P[Ec 1 Ec 2], which by (5) implies that Algorithm 2 is a δ-true mode estimator.

4.3 Query Complexity Upper bound

From the analysis of the sample complexity for the QM1 model derived in Section 3.3, we get one upper bound for the QM2 case. Algorithm 1 continues for at most T rounds

with probability at least 1 δ, where T is given by Theorem 2. A sample accessed in each of these rounds can be compared to at most min{k, T } other samples. Thus, a natural upper bound on the query complexity of Algorithm 2 is (T min{k, T }). The following result provides a tighter upper bound.

Theorem 5. For a δ-true mode estimator A2, corresponding to Algorithm 2, we have the following with probability at least (1 δ).

QP δ (A2) 592

3 p1 (p1 p2)2 log

k δ p1 (p1 p2)2

3 p1 (p1 pi)2 log

k δ p1 (p1 pi)2

for T = min k, 592

3 p1 (p1 p2)2 log 592

k δ p1 (p1 p2)2

Proof. The detailed calculations are provided in (Shah et al. 2019, Appendix B), here we give a sketch. Under the event Ec 1 Ec 2, where all conﬁdence bounds hold true, for any bin represented during the run of the algorithm and corresponding to some element i = 1 from the support, we have that it will deﬁnitely be excluded from C(t) when its conﬁdence bound βt i < p1 pi

4 and βt 1 < p1 pi

4 . We ﬁnd a t which satisﬁes the stopping condition for each of the bins represented, and summing over them gives the total query complexity. Following the calculations in (Shah et al. 2019, Appendix B), we get the following value of t i for the bin corresponding to element i = 1, such that by t = t i the bin will deﬁnitely be excluded from C(t).

3 p1 (p1 pi)2 log

δ p1 (p1 pi)2

Also, for the ﬁrst bin we get t 1 as follows.

3 p1 (p1 p2)2 log

δ p1 (p1 p2)2

Also, the total number of bins created will be at most

T = min k, 592

3 p1 (p1 p2)2 log 592

δ p1 (p1 p2)2

A sample from the bin corresponding to element i will be involved in at most t i queries. Hence the total number of queries, QP δ (A), is bounded as follows

QP δ (A) 592

3 p1 (p1 p2)2 log

k δ p1 (p1 p2)2

3 p1 (p1 pi)2 log

k δ p1 (p1 pi)2

4.4 Query Complexity Lower bound The following theorem gives a lower bound on the expected query complexity for the QM2 model. Theorem 6. For any δ-true mode estimator A, we have,

E[QP δ (A)] p1 2(p1 p2)2 log(1/2.4δ).

Proof. Consider any δ-true mode estimator A under query model QM2 and let τ denote its average (pairwise) query complexity when the underlying distribution is P. Next, consider the QM1 query model and let estimator A simply simulate the estimator A under QM2, by querying the values of any sample indices involved in the pairwise queries. It is easy to see that since A is a δ-true mode estimator under QM2, the same will be true for A as well under QM1. Furthermore, the expected query complexity of A under QM1 will be at most 2τ since each pairwise query involves two sample indices. Thus, if the query complexity τ of A under QM2 is less than p1 2(p1 p2)2 log(1/2.4δ), then we have a δ-true mode estimator under query model QM1 with query complexity less than p1 (p1 p2)2 log(1/2.4δ), which contradicts the lower bound in Theorem 3. The result then follows.

The lower bound on the query complexity as given by the above theorem matches the ﬁrst term of the upper bound given in Theorem 5. So, the lower bound will be close to the upper bound when the ﬁrst term dominates, in particular when p1 (p1 p2)2 k i=3 p1 (p1 pi)2 .

While the above lower bound matches the upper bound in a certain restricted regime, we would like a sharper lower bound which works more generally. Towards this goal, we consider a slightly altered (and relaxed) problem setting which relates to the problem of best arm identiﬁcation in a multi-armed bandit (MAB) setting studied in (Kaufmann, Capp e, and Garivier 2016), (Soare, Lazaric, and Munos 2014). The altered setting and the corresponding result are discussed in (Shah et al. 2019, Appendix C).

5 Experimental Results For both the QM1 and QM2 models, we simulate Algorithm 1 and Algorithm 2 for various synthetic distributions. We take k = 5120 and keep the difference p1 p2 = 0.208 constant for each distribution. For the other pi s we follow two different models: 1. Uniform distribution : The other pi s for i = 3....k are chosen such that each pi = 1 p1 p2

2. Geometric distribution : The other pi s are chosen such that p2, p3....pk form a decreasing geometric distribution which sums upto 1 p1. For each distribution we run the experiment 50 times and take an average to plot the query complexity. In Fig. 1 we plot the number of queries taken by Algorithm 1 for QM1, for both the geometric and uniform distributions. As suggested by our theoretical results, the query complexity increases (almost) linearly with p1 for a ﬁxed (p1 p2). In

Figure 1: Number of queries for Algorithm 1 under the uniform and geometric distributions.

Figure 2: Number of queries when each sample is queried with all the bins, and number of queries for Algorithm 2 for the uniform and geometric distributions.

Fig. 2 we plot the number of queries taken by Algorithm 2 for QM2 and compare it to the number of queries taken by a naive algorithm which queries a sample with all the bins formed, for both the uniform and geometric distributions. To further see how Algorithm 2 performs better than the naive algorithm which queries a sample against all bins formed, we plot the number of queries in each round for a particular geometric distribution, in Fig. 3. We observe that over the rounds, for Algorithm 2 bins start getting discarded and hence the number of queries per round decreases, while for the naive algorithm, as more and more bins are formed, the number of queries per round keeps increasing.

Real world dataset: As mentioned in the introduction, one of the applications of mode estimation is partial clustering. Via experiments on a real-world purchase data set (Leskovec and Krevl 2014), we were able to benchmark the performance of our proposed Algorithm 2 for pairwise queries, a naive variant of it with no UCB-based bin elimina-

Figure 3: Number of queries per round for Algorithm 2 and for the algorithm which queries each sample with all of the bins formed.

tion and the algorithm of (Mazumdar and Saha 2017a) which performs a full clustering of the nodes. Using (Leskovec and Krevl 2014), we create a dataset where each entry represents an item available on Amazon.com along with a label corresponding to a category that the item belongs to. We take entries with a common label to represent a cluster and we consider the task of using pairwise oracle queries to identify a cluster of items within the top-10 clusters, from among the top-100 clusters in the entire dataset. Note that while our algorithm is tailored towards ﬁnding a heavy cluster, the algorithm in (Mazumdar and Saha 2017a) proceeds by ﬁrst learning the entire clustering and then identifying a large cluster. Some statistics of the dataset over which we ran our algorithm are: number of clusters - 100; number of nodes - 291925; size of largest cluster - 53551; and size of 11th largest cluster - 19390. With a target conﬁdence of 99%, our proposed algorithm terminates in 631k pairwise queries while the naive algorithm takes 1474k queries (2.3x more). The algorithm of (Mazumdar and Saha 2017a) clusters all nodes ﬁrst, and is thus expected to take around n k = 29192.5k queries (46x more) to terminate; this was larger than our computational budget and hence could not be run successfully. With a target conﬁdence of 95%, our algorithm takes 558k queries instead of the naive version which takes 1160k queries (2x more) to terminate.

6 Discussion There are a few variations of the standard mode estimation problem which have important practical applications and we discuss them in the following subsections.

6.1 Top-m estimation An extension of the mode estimation problem discussed could be estimating the top-m values. i.e. for p1 > p2 > .... > pk, an algorithm should return the set {1, 2, ...m}. A possible application is in clustering, where we are interested in ﬁnding the largest m clusters. The algorithms 1 and 2 for

QM1 and QM2 would change only in the stopping rule. The new stopping rule would be such that At = (stop,.) when there exist m bins such that their LCB is greater than the UCB of the other remaining bins. In this setting, we deﬁne a δ-true top-m estimator as an algorithm which returns the set {1, 2, ....m} with probability at least (1 δ). In the following results we give bounds on the query complexity, QP δ (A) for a δ-true top-m estimator A, for the QM1 query model. Theorem 7. For a δ-true top-m estimator Am, corresponding to Algorithm 1 for the top-m case, we have the following with probability at least (1 δ).

max i {1,2....m} j {m+1,....k}

3 pi (pi pj)2 log

k δ pi (pi pj)2

Proof. The proof follows along the same lines as Theorem 2. Here, for a bin i {1, 2, ...m} and a bin j {m + 1, ....k}, their conﬁdence bounds would be separated when βt i < pi pj

4 and βt j < pi pj

4 . The calculations then follow similarly as before to give the above upper bound.

Theorem 8. For any δ-true top-m estimator A, we have,

E[QP δ (A)] max i {1,2....m} j {m+1,....k}

pi (pi pj)2 log (1/2.4δ)

Proof. The proof follows along the same lines as Theorem 3. Here, for ϵ > 0, the alternate distribution P that we choose would have p i = pi+pj

2 ϵ and p j = pi+pj

2 + ϵ for some i {1, 2....m} and some j {m + 1, ....k}. We take the maximum value over all such i, j to give the lower bound.

6.2 Noisy oracle Here, we consider a setting where the oracle answers queries noisily and we analyze the impact of errors on the query complexity for mode estimation. For the noisy QM1 model, we assume that when we query the oracle with some index j, the value revealed is the true sample value Xj with probability (1 pe) and any of the k values in the support with probability pe

k each. The problem of mode estimation in this noisy setting is equivalent to that in a noiseless setting where the underlying distribution is given by p i = (1 pe)pi + pe

Since the mode of this altered distribution is the same as the true distribution, we can use Algorithm 1 for mode estimation under the noisy QM1 model, and the corresponding query complexity bounds in Section 3 hold true. For the noisy QM2 model, we assume that for any pair of indices i and j sent to the oracle, it returns the correct answer (+1 if Xi = Xj, 1 otherwise) with probability (1 pe). This setting is technically more involved and is in fact similar to clustering using pairwise noisy queries (Mazumdar and Saha 2017a). The mode estimation problem in this setting corresponds to identifying the largest cluster, which has been studied in (Choudhury, Shah, and Karamchandani 2019). Deriving tight bounds for this case is still open.

References Bubeck, S.; Cesa-Bianchi, N.; et al. 2012. Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Foundations and Trends R in Machine Learning 5(1):1 122. Chernoff, H. 1964. Estimation of the mode. Annals of the Institute of Statistical Mathematics 16(1):31 41. Choudhury, T.; Shah, D.; and Karamchandani, N. 2019. Top-m clustering with a noisy oracle. In 2019 National Conference on Communications (NCC) (NCC 2019). Cover, T. M., and Thomas, J. A. 2012. Elements of information theory. John Wiley & Sons. Karp, R. M.; Shenker, S.; and Papadimitriou, C. H. 2003. A simple algorithm for ﬁnding frequent elements in streams and bags. ACM Transactions on Database Systems (TODS) 28(1):51 55. Kaufmann, E.; Capp e, O.; and Garivier, A. 2016. On the complexity of best-arm identiﬁcation in multi-armed bandit models. The Journal of Machine Learning Research 17(1):1 42. Leskovec, J., and Krevl, A. 2014. SNAP Datasets: Stanford large network dataset collection. http://snap.stanford.edu/ data/com-Amazon.html. Manku, G. S., and Motwani, R. 2002. Approximate frequency counts over data streams. In VLDB 02: Proceedings of the 28th International Conference on Very Large Databases, 346 357. Elsevier. Maurer, A., and Pontil, M. 2009. Empirical Bernstein bounds and sample variance penalization. ar Xiv preprint ar Xiv:0907.3740. Mazumdar, A., and Saha, B. 2016. Clustering via crowdsourcing. ar Xiv preprint ar Xiv:1604.01839. Mazumdar, A., and Saha, B. 2017a. Clustering with noisy queries. In Advances in Neural Information Processing Systems, 5788 5799. Mazumdar, A., and Saha, B. 2017b. Query complexity of clustering with side information. In Advances in Neural Information Processing Systems, 4682 4693. Mazumdar, A., and Saha, B. 2017c. A theoretical analysis of ﬁrst heuristics of crowdsourced entity resolution. In Thirty First AAAI Conference on Artiﬁcial Intelligence. Misra, J., and Gries, D. 1982. Finding repeated elements. Science of computer programming 2(2):143 152. Parzen, E. 1962. On estimation of a probability density function and mode. The Annals of Mathematical Statistics 33(3):1065 1076. Popescu, P. G.; Dragomir, S.; Slus anschi, E. I.; and St an as il a, O. N. 2016. Bounds for kullback-leibler divergence. Electronic Journal of Differential Equations 2016. Shah, D.; Choudhury, T.; Karamchandani, N.; and Gopalan, A. 2019. Sequential mode estimation with oracle queries. Soare, M.; Lazaric, A.; and Munos, R. 2014. Best-arm identiﬁcation in linear bandits. In Advances in Neural Information Processing Systems, 828 836. Wald, A. 1944. On cumulative sums of random variables. The Annals of Mathematical Statistics 15(3):283 296.